Thread by Ben Collins
- Tweet
- Jul 19, 2022
- #ComputerScience #CareerDevelopment
Thread
Today's #GoogleSheets tip wasn't ever meant to be a tip. I was just trying to scrape some data for another tip...
But creating the formula was so interesting (and, dare I say, fun! ๐) that I decided to share it instead.
Buckle up... We're in for a wild ride!
But creating the formula was so interesting (and, dare I say, fun! ๐) that I decided to share it instead.
Buckle up... We're in for a wild ride!
Here's the problem we're going to solve with today's tip โฌ๏ธ
We want to scrape the population data from this table in Wikipedia...
We want to scrape the population data from this table in Wikipedia...
... and put it into a Google Sheet like this, so we can use it for analysis.
The challenge today is to do this with a single formula!
Ready? Let's jump in!
The challenge today is to do this with a single formula!
Ready? Let's jump in!
1๏ธโฃ Open a new Google Sheet.
2๏ธโฃ In cell A1, insert a basic IMPORTHTML formula to scrape the raw table of population data from Wikipedia:
=IMPORTHTML("en.wikipedia.org/wiki/List_of_cities_proper_by_population","table",2)
The data has some issues, but it's a start.
2๏ธโฃ In cell A1, insert a basic IMPORTHTML formula to scrape the raw table of population data from Wikipedia:
=IMPORTHTML("en.wikipedia.org/wiki/List_of_cities_proper_by_population","table",2)
The data has some issues, but it's a start.
3๏ธโฃ Pick the columns we want, by wrapping the Import formula with a Query formula.
(Use Col1 notation in our Select statement, not the column letter.)
=QUERY(IMPORTHTML("en.wikipedia.org/wiki/List_of_cities_proper_by_population","table",2),"select Col2, Col8, Col4",1)
(Use Col1 notation in our Select statement, not the column letter.)
=QUERY(IMPORTHTML("en.wikipedia.org/wiki/List_of_cities_proper_by_population","table",2),"select Col2, Col8, Col4",1)
4๏ธโฃ Hmm, that population column is messed up! Regex to the rescue! โ๏ธ
At this point, we'll deal with the population column on its own and come back to our main formula later...
At this point, we'll deal with the population column on its own and come back to our main formula later...
So make a copy of this Import formula in cell F1 and change it to:
=QUERY(IMPORTHTML("en.wikipedia.org/wiki/List_of_cities_proper_by_population","table",2),"select Col4",1)
Now we should have a copy of just the population column in column F.
=QUERY(IMPORTHTML("en.wikipedia.org/wiki/List_of_cities_proper_by_population","table",2),"select Col4",1)
Now we should have a copy of just the population column in column F.
5๏ธโฃ Wrap this population only formula in cell F1 with a Regex formula:
=REGEXEXTRACT(QUERY(IMPORTHTML("en.wikipedia.org/wiki/List_of_cities_proper_by_population","table",2),"select Col4",1),"โ ([0-9,]*)")
Hmm, that gives us a #N/A error... ๐ค
=REGEXEXTRACT(QUERY(IMPORTHTML("en.wikipedia.org/wiki/List_of_cities_proper_by_population","table",2),"select Col4",1),"โ ([0-9,]*)")
Hmm, that gives us a #N/A error... ๐ค
6๏ธโฃ Turn this into an Array Formula and get the column of population numbers!
=ARRAYFORMULA(REGEXEXTRACT(QUERY(IMPORTHTML("en.wikipedia.org/wiki/List_of_cities_proper_by_population","table",2),"select Col4",1),"โ ([0-9,]*)"))
=ARRAYFORMULA(REGEXEXTRACT(QUERY(IMPORTHTML("en.wikipedia.org/wiki/List_of_cities_proper_by_population","table",2),"select Col4",1),"โ ([0-9,]*)"))
The Regex formula uses this expression "โ ([0-9,]*)" to extract the numbers after the funny โ symbol.
We still have two problems to solve though:
๐น we need to convert the strings into actual numbers and
๐น fix the #N/A column heading
...
We still have two problems to solve though:
๐น we need to convert the strings into actual numbers and
๐น fix the #N/A column heading
...
7๏ธโฃ Use the SUBSTITUTE function to remove the commas and convert the strings into numbers:
=ARRAYFORMULA(SUBSTITUTE(REGEXEXTRACT(QUERY(IMPORTHTML("en.wikipedia.org/wiki/List_of_cities_proper_by_population","table",2),"select Col4",1),"โ ([0-9,]*)"),",","")*1)
=ARRAYFORMULA(SUBSTITUTE(REGEXEXTRACT(QUERY(IMPORTHTML("en.wikipedia.org/wiki/List_of_cities_proper_by_population","table",2),"select Col4",1),"โ ([0-9,]*)"),",","")*1)
(The multiplication by 1 at the very end coerces the strings into numbers after the commas have been removed.)
8๏ธโฃ Use the IFERROR function to fix that pesky #N/A error at the top of our column heading, and replace the #N/A with the word "Population"...
=ARRAYFORMULA(IFERROR(SUBSTITUTE(REGEXEXTRACT(QUERY(IMPORTHTML("en.wikipedia.org/wiki/List_of_cities_proper_by_population","table",2),"select Col4",1),"โ ([0-9,]*)"),",","")*1,"Population"))
Nice, now we have our population column as numbers.
Nice, now we have our population column as numbers.
9๏ธโฃ Go back to the main formula in A1 and remove the old population column (with the funny numbers). Our formula in A1 should now be:
=QUERY(IMPORTHTML("en.wikipedia.org/wiki/List_of_cities_proper_by_population","table",2),"select Col2, Col8",1)
=QUERY(IMPORTHTML("en.wikipedia.org/wiki/List_of_cities_proper_by_population","table",2),"select Col2, Col8",1)
๐ All that's left is to join these two ranges, in columns A, B and F, using the curly bracket notation.
This formula is too big to fit in one tweet, so I'm going to have to split it. Forgive me!
={
QUERY(IMPORTHTML("en.wikipedia.org/wiki/List_of_cities_proper_by_population", "table",2),"select Col2, Col8",1)
,
This formula is too big to fit in one tweet, so I'm going to have to split it. Forgive me!
={
QUERY(IMPORTHTML("en.wikipedia.org/wiki/List_of_cities_proper_by_population", "table",2),"select Col2, Col8",1)
,
ARRAYFORMULA(IFERROR(SUBSTITUTE(REGEXEXTRACT(QUERY(IMPORTHTML("en.wikipedia.org/wiki/List_of_cities_proper_by_population","table",2),"select Col4",1),"โ ([0-9,]*)"),",","")*1,"Population"))
}
We're done! ๐คช Here's the output.
(You can delete the workings in column F.)
}
We're done! ๐คช Here's the output.
(You can delete the workings in column F.)
Yes, it would have been quicker to cut and paste the table from Wikipedia and fix the funny formats manually.
But where's the fun in that? ๐คฃ
Think of this as an exercise in combining some of the most useful single functions in Google Sheets to create really powerful formulas.
But where's the fun in that? ๐คฃ
Think of this as an exercise in combining some of the most useful single functions in Google Sheets to create really powerful formulas.
By the way, my free 30 Day Advanced Formulas course covers all of the functions/formulas we used today...
๐น Query function: days 14 and 15
๐น Array formulas: day 17
๐น Import function: day 19
๐น Regex function: day 20
You can take the course for free at courses.benlcollins.com/p/advanced30/
๐น Query function: days 14 and 15
๐น Array formulas: day 17
๐น Import function: day 19
๐น Regex function: day 20
You can take the course for free at courses.benlcollins.com/p/advanced30/