Finding the Median value using SQL

Here is a simple method of finding the median value using SQL. The first will be a simple example that finds the median number of units sold for a given book (ISBN) over a period of time. There will be 3 sections of code (that will be combined using CTEs). First we need to find the median number, that is for a given book, how many times was it sold. The sales_data table contains 3 columns: ISBN, SaleDate, UnitsSold

select ISBN,
       ceiling(count(*)/2.0) as median
  from sales_data
 group by ISBN

The use of ceiling in the above equation accounts for the fact that if we have only 1 sale for an ISBN, then 1/2=0, whereas ceiling(1/2.0)=1. The need for the median to start at 1 and not 0 will become evident in the text below.

select ISBN,
       Units_Sold,
       ROW_NUMBER() OVER(Partition by ISBN Order by Units_Sold) as r_num
  from sales_data

Since we want to know the median of UnitsSold we must Order by UnitsSold. Using the rownumber() function we get a set of numbers, starting at 1 (here is the need for ceiling above) and continuing until we get a new ISBN. Once we get a new ISBN, the numbering restarts at 1. Let’s say that our test data looks like this:

ISBN Units_Sold r_num

9780000000001 10 1

9780000000001 13 2

9780000000001 20 3

Our median will be ceiling(3/2.0)=2. That means the median value for Units_Sold for ISBN=9780000000001 is 13. The full code is below. Please note I create this as a view so there is no need to be a ; before the first with.

Create view MedianUnitsSold
as
with base
as
(
select ISBN,
       ceiling(count(*)/2.0) as median
  from sales_data
 group by ISBN
),
population
as
(
select ISBN,
       Units_Sold,
       ROW_NUMBER() OVER(Partition by ISBN Order by Units_Sold) as r_num
  from sales_data
)
select p.ISBN,
       p.Units_Sold
  from population p
  join base b on (b.ISBN=p.ISBN
       and b.median=p.r_num)

As always, I hope this post has been able to help you solve a problem that you’ve had or provided you with some new insight. Feel free to post comments below.

Auto-increment without the need of a table.

I came across a situation where I need unique numbers for both a large and small set of data. The standard approach is to create a table with an identity column and do your inserts. For a table with tens of thousands of records, this might be the best approach, but there is another one: ROW_NUMBER().

The key to using the ROW_NUMBER() function is to provide a constant so you will get a contiguous sequence.  Let's look at sample data.

Format
  AB
  AL
  CD
  CE
  GG
  LI

Let's look at the function:

ROW_NUMBER() OVER (Partition by <group> order by <columns>)

I can first construct the function as follows

 select Format,
         ROW_NUMBER() OVER(PARTITION BY Format ORDER BY FORMAT) as id

However this will give us the following results:

id Format
 1   AB
 1   AL
 1   CD
 1   CE
 1   GG
 1   LI

The reason is your PARTITION BY creates a unique number for each format. To solve this problem you need to put a value that will never change over the life of the select. I use 1.

 select Format,
         ROW_NUMBER() OVER(PARTITION BY 1 ORDER BY FORMAT) as id

The new result set is:

id Format
 1   AB
 2   AL
 3   CD
 4   CE
 5   GG
 6   LI

Hope this helps improve your sql. As always, leave comments below.

 

 

 

 

Using Row_Number to pick the correct entry.

There have been numerous occasions where I have to pick a one of two records out of a given result set, but how I determine which one has some unique rules.  I have two examples of this and I show how I used Row_Number() to solve this.

The first set of data has these sets of rules:

  • For any given ISBN13, give me the 'Acquisition Editor' if they exist and if they do not exist, provide the 'Editor', known as [Role].
  • If there are multiple persons for the chosen [Role], pick the one with the lowest [SortOrder].
  • If [SortOrder] is null, pick the first name alphabetically.

Here is the solution:

select ISBN13,
       DisplayName,
       [Role]
  from (select ISBN13,
               DisplayName,
               [Role],
               ROW_NUMBER() OVER(Partition by ISBN13 order by [Role],isnull(cast(SortOrder as varchar(40)),DisplayName)) as Row
          from dbo.ProductContact
          where C006_Role in ('Acquisition Editor','Editor')) d
where Row=1

The second set of data had the following rules:

  • For a given ISBN13, provide the sum of the  'First Printing', which can be under either PrintNumber '0000' or '0001'. 
  • PrintNumber '0000' is always chosen over '0001' if it exists.

Here is the solution:

select isbn13,
       Printing_Number,
       First_Run
   from (SELECT isbn13,
                Printing_Number,
                Sum(Print_Run) as First_Run,                
                ROW_NUMBER() OVER(PARTITION by isbn13 order by Printing_Number) as Row
           FROM dbo.PPB
          where Printing_Number in ('0000','0001')
          group by isbn13,Printing_Number) d
  where Row=1