By mastering the ImportXML function of Google Sheet, you'll feel like you already own a certified Sheets Wizard. ImportXML obtains information from any XML field. Thanks to that, you can download the data and metadata generated on it anywhere.
How to use the ImportXML function of Google Sheets
The XML markup language specifies the data sets in a web page. In essence, any set of <something> and </something> - the building blocks of the web source code or a certain set of data will reside inside them. The source code of the web will have some text in the <p> aragraph tag - a paragraph, sometimes containing <b> old - in bold text and possibly <a> a link - link (followed by </ a > </b>. </p> </body> to close the entire tag).
The Google Sheets ImportXML function can find a certain XML data set and copy data outside of it. In the above example, if we want to get all the links on the page, we need to ask the ImportXML function to enter all the information in the <a> </a> tag . If you want the whole text of a web, you can start by taking everything in <body> </body> or each version of <p> </p> , then deleting the data at later stages.
How to extract a list of postcode and county in the city
The tables in Wikipedia are great ImportXML exercises. This article will take the example of downloading the entire postcode in Edmonton, Alberta. Find a list of Canadian postcodes starting with the letter T. Open that page in a new browser window to get started.
Select a postcode, right-click on it and select Inspect to open the browser tool to view the page source. You will see each page source code is in a tag (identify a cell in the table). After that, the article will import all TD tags contained from Edmonton in them.
Create a new blank Google Sheet. The article will take all TD tag content, including <span> and link by specifying the data that you want to use XPath syntax. ImportXML takes the URL and tag you are looking for as an argument to import into Google Sheets.
Going back to the page source, we will see the postal code in bold in the <b> </b> tag, the city name that links to the Wikipedia articles under <a> </a>. Now try to get links only in each large city box and remove other links (neighborhoods). Edit them into two key commands column A and B:
= importxml ("https://en.wikipedia.org/wiki/List_of_T_postal_codes_of_Canada", "// td / span / a [1]")
= importxml ("https://en.wikipedia.org/wiki/List_of_T_postal_codes_of_Canada", "// td / b [1]")
You need to refine the results a bit:
This action helps you understand how the XPath query syntax works: a tag only provides the first version of <tag> in the <parent tag>. Therefore, td / span / a [1] gives you the first link in <span> at each <td>. Similarly, td / b [1] gives you the first bold text in each <td> or only the postal code in this case.
The great thing is that you can execute two queries in a function. Therefore, the article combines two requests with an | symbol between:
= importxml ("https://en.wikipedia.org/wiki/List_of_T_postal_codes_of_Canada", "// td / span / a [1] | // td / b [1]")
However, you will not get the same previous results. It will alternate the entire request combined into a long list, instead of two columns. It has many benefits but is not necessary in this article.
To select the postcode in the boxes containing the 'Edmonton' link. We will use this code:
= importxml ("https://en.wikipedia.org/wiki/List_of_T_postal_codes_of_Canada", "// td [span / a = 'Edmonton'] / b [1]")
Put the "search" - text eligible to narrow the results in square brackets without affecting the way to bring results.
Now to names of nearby areas. Write the appropriate importXML function in the next column, getting the following text from "Edmonton."
The article takes the entire contents of the span [1] and uses parentheses and crosshairs to divide the content, putting "Edmonton" in the first column and the neighborhood name in the following column. We can then combine the postcode with the corresponding name:
Next, use the Split function and concatenate some of the following columns to split & group the data being processed:
= SPLIT (concatenate (B2: J2), "(/)")
Finally, here is the results table with the necessary information:
How to automatically copy email addresses from the web
The article will guide you how to get all employees' emails on About | page Zapier. Looking at the source code, you will see that each member's email address is in the class = "email" field. When you want to specify a tag attribute, use the Google Sheets ImportXML function as follows:
= importxml ("https://zapier.com/about//", "// span [@ class = 'email']")
How to use Regex to import email addresses from the web in Google Sheets.
To get Zapier addresses using Regex's "power", we'll enter the <span> command instead of looking for the class. Now we'll perform this task in two steps: Call information from the Zapier page into the first column, then, sort the email into the second column:
Remember, ImportXML fills in all the columns and rows by itself depending on the data it finds. The regex query must be filled in every cell you want to get results. To put it all together, you only need to use the Regexextract command, which is an array constant formula: