Python Web Scraping: tips and tricks

Use pandas read_html() to parse HTML tables (when possible)

Some pages you might be scraping might include old-school HTML <table> elements with tabular data. There’s an easy way to extract data from pages like this with into Pandas dataframes (although you may need to clean up it afterward):

>>> import pandas as pd
>>> pd.read_html('https://en.wikipedia.org/wiki/Yakutsk')[0]
                                      Yakutsk  Якутск                                  Yakutsk  Якутск.1
0                 City under republic jurisdiction[1]                City under republic jurisdiction[1]
1                              Other transcription(s)                             Other transcription(s)
2                                              Yakut                                         Дьокуускай
3                        Central Yakutsk from the air                       Central Yakutsk from the air
4   .mw-parser-output .ib-settlement-cols{text-ali...  .mw-parser-output .ib-settlement-cols{text-ali...
5                                 Location of Yakutsk                                Location of Yakutsk
6   .mw-parser-output .locmap .od{position:absolut...  .mw-parser-output .locmap .od{position:absolut...
7   Coordinates: .mw-parser-output .geo-default,.m...  Coordinates: .mw-parser-output .geo-default,.m...
8                                             Country                                             Russia
9                                     Federal subject                                  Sakha Republic[2]
10                                            Founded                                               1632
11                                  City status since                                               1643
12                                         Government                                         Government
13                                              Body                                      Okrug Council
14                                              Head                                   Evgeny Grigoriev
15                                               Area                                               Area
16                                             Total                                 122 km2 (47 sq mi)
17                                          Elevation                                      95 m (312 ft)
18                                         Population                                         Population
19                                Estimate (2018)[3]                                             311760
20                                              Rank                                       68th in 2010
21                              Administrative status                              Administrative status
22                                   Subordinated to        city of republic significance of Yakutsk[1]
23                                        Capital of                                  Sakha Republic[2]
24                                        Capital of        city of republic significance of Yakutsk[1]
25                                   Municipal status                                   Municipal status
26                                       Urban okrug                             Yakutsk Urban Okrug[4]
27                                        Capital of                             Yakutsk Urban Okrug[4]
28                                          Time zone                                  UTC+9 (MSK+6 [5])
29                                  Postal code(s)[6]                                             677xxx
30                                    Dialing code(s)                                         +7 4112[7]
31                                           OKTMO ID                                        98701000001
32                                           City Day                         Second Sunday of September
33                                            Website                                          якутск.рф

Look for JSON snippets inside HTML and extract them

Some sites have pages with JSON snippets that are meant to be parsed by search engines. They provide pre-structured data and can save you some hassle with trying to get your XPath queries or CSS selectors right.

See schema.org specification for examples of what you might find on sites that want to make search engine crawling easier.

Use js2xml to extract data from JavaScript snippets

On the other hand, some sites will give you pages with JavaScript snippets containing data you want to scrape. It is a bit of a hassle to use regular expressions or basic string processing techniques to extract the data. Luckily, there is a Python module called js2xml developed by Zyte, the creators of Scrapy framework. This module converts JS code into XML trees that you can run XPath queries against.

The official README file provides the following example:

>>> import js2xml
>>>
>>> jscode = """function factorial(n) {
...     if (n === 0) {
...         return 1;
...     }
...     return n * factorial(n - 1);
... }"""
>>> parsed = js2xml.parse(jscode)
>>>
>>> parsed.xpath("//funcdecl/@name")  # extracts function name
['factorial']
>>>
>>> print(js2xml.pretty_print(parsed))  # pretty-print generated XML
<program>
  <funcdecl name="factorial">
    <parameters>
      <identifier name="n"/>
    </parameters>
    <body>
      <if>
        <predicate>
          <binaryoperation operation="===">
            <left>
              <identifier name="n"/>
            </left>
            <right>
              <number value="0"/>
            </right>
          </binaryoperation>
        </predicate>
        <then>
          <block>
            <return>
              <number value="1"/>
            </return>
          </block>
        </then>
      </if>
      <return>
        <binaryoperation operation="*">
          <left>
            <identifier name="n"/>
          </left>
          <right>
            <functioncall>
              <function>
                <identifier name="factorial"/>
              </function>
              <arguments>
                <binaryoperation operation="-">
                  <left>
                    <identifier name="n"/>
                  </left>
                  <right>
                    <number value="1"/>
                  </right>
                </binaryoperation>
              </arguments>
            </functioncall>
          </right>
        </binaryoperation>
      </return>
    </body>
  </funcdecl>
</program>

>>>

Launch lynx --dump to quickly extract text

Lynx is text-only web browser dating back to the early days of World Wide Web. It has a CLI option to quickly extract text from simple web pages:

$ lynx --dump https://quotes.toscrape.com/
[1]Quotes to Scrape

   [2]Login
   "The world as we have created it is a process of our thinking. It
   cannot be changed without changing our thinking." by Albert Einstein
   [3](about)
   Tags:
   [4]change [5]deep-thoughts [6]thinking [7]world
   "It is our choices, Harry, that show what we truly are, far more than
   our abilities." by J.K. Rowling [8](about)
   Tags:
   [9]abilities [10]choices
   "There are only two ways to live your life. One is as though nothing is
   a miracle. The other is as though everything is a miracle." by Albert
   Einstein [11](about)
   Tags:
   [12]inspirational [13]life [14]live [15]miracle [16]miracles

   "The person, be it gentleman or lady, who has not pleasure in a good
   novel, must be intolerably stupid." by Jane Austen [17](about)
   Tags:
   [18]aliteracy [19]books [20]classic [21]humor

   "Imperfection is beauty, madness is genius and it's better to be
   absolutely ridiculous than absolutely boring." by Marilyn Monroe
   [22](about)
   Tags:
   [23]be-yourself [24]inspirational

   "Try not to become a man of success. Rather become a man of value." by
   Albert Einstein [25](about)
   Tags:
   [26]adulthood [27]success [28]value

   "It is better to be hated for what you are than to be loved for what
   you are not." by Andr? Gide [29](about)
   Tags:
   [30]life [31]love

   "I have not failed. I've just found 10,000 ways that won't work." by
   Thomas A. Edison [32](about)
   Tags:
   [33]edison [34]failure [35]inspirational [36]paraphrased

   "A woman is like a tea bag; you never know how strong it is until it's
   in hot water." by Eleanor Roosevelt [37](about)
   Tags:
   [38]misattributed-eleanor-roosevelt

   "A day without sunshine is like, you know, night." by Steve Martin
   [39](about)
   Tags:
   [40]humor [41]obvious [42]simile

     * [43]Next ->

Top Ten tags

   [44]love [45]inspirational [46]life [47]humor [48]books [49]reading
   [50]friendship [51]friends [52]truth [53]simile

   Quotes by: [54]GoodReads.com

   Made with ? by [55]Scrapinghub

References

   1. https://quotes.toscrape.com/
   2. https://quotes.toscrape.com/login
   3. https://quotes.toscrape.com/author/Albert-Einstein
   4. https://quotes.toscrape.com/tag/change/page/1/
   5. https://quotes.toscrape.com/tag/deep-thoughts/page/1/
   6. https://quotes.toscrape.com/tag/thinking/page/1/
   7. https://quotes.toscrape.com/tag/world/page/1/
   8. https://quotes.toscrape.com/author/J-K-Rowling
   9. https://quotes.toscrape.com/tag/abilities/page/1/
  10. https://quotes.toscrape.com/tag/choices/page/1/
  11. https://quotes.toscrape.com/author/Albert-Einstein
  12. https://quotes.toscrape.com/tag/inspirational/page/1/
  13. https://quotes.toscrape.com/tag/life/page/1/
  14. https://quotes.toscrape.com/tag/live/page/1/
  15. https://quotes.toscrape.com/tag/miracle/page/1/
  16. https://quotes.toscrape.com/tag/miracles/page/1/
  17. https://quotes.toscrape.com/author/Jane-Austen
  18. https://quotes.toscrape.com/tag/aliteracy/page/1/
  19. https://quotes.toscrape.com/tag/books/page/1/
  20. https://quotes.toscrape.com/tag/classic/page/1/
  21. https://quotes.toscrape.com/tag/humor/page/1/
  22. https://quotes.toscrape.com/author/Marilyn-Monroe
  23. https://quotes.toscrape.com/tag/be-yourself/page/1/
  24. https://quotes.toscrape.com/tag/inspirational/page/1/
  25. https://quotes.toscrape.com/author/Albert-Einstein
  26. https://quotes.toscrape.com/tag/adulthood/page/1/
  27. https://quotes.toscrape.com/tag/success/page/1/
  28. https://quotes.toscrape.com/tag/value/page/1/
  29. https://quotes.toscrape.com/author/Andre-Gide
  30. https://quotes.toscrape.com/tag/life/page/1/
  31. https://quotes.toscrape.com/tag/love/page/1/
  32. https://quotes.toscrape.com/author/Thomas-A-Edison
  33. https://quotes.toscrape.com/tag/edison/page/1/
  34. https://quotes.toscrape.com/tag/failure/page/1/
  35. https://quotes.toscrape.com/tag/inspirational/page/1/
  36. https://quotes.toscrape.com/tag/paraphrased/page/1/
  37. https://quotes.toscrape.com/author/Eleanor-Roosevelt
  38. https://quotes.toscrape.com/tag/misattributed-eleanor-roosevelt/page/1/
  39. https://quotes.toscrape.com/author/Steve-Martin
  40. https://quotes.toscrape.com/tag/humor/page/1/
  41. https://quotes.toscrape.com/tag/obvious/page/1/
  42. https://quotes.toscrape.com/tag/simile/page/1/
  43. https://quotes.toscrape.com/page/2/
  44. https://quotes.toscrape.com/tag/love/
  45. https://quotes.toscrape.com/tag/inspirational/
  46. https://quotes.toscrape.com/tag/life/
  47. https://quotes.toscrape.com/tag/humor/
  48. https://quotes.toscrape.com/tag/books/
  49. https://quotes.toscrape.com/tag/reading/
  50. https://quotes.toscrape.com/tag/friendship/
  51. https://quotes.toscrape.com/tag/friends/
  52. https://quotes.toscrape.com/tag/truth/
  53. https://quotes.toscrape.com/tag/simile/
  54. https://www.goodreads.com/quotes
  55. https://scrapinghub.com/

If you are doing some Natural Language Processing and do not want to spend too much time on web scraping it might be a helpful shortcut to call a command like this through Python subprocess module.

Use multiprocessing.dummy for easy multithreading of highly parallel tasks

For spreading the workload across a pool of child worker processes, Python provides us with multiprocessing module. However, web scraping scripts are typically I/O bound and not CPU bound. It makes more sense to distribute workload across threads, not processes. Luckily, Python developers were kind enough to provide us with multiprocessing.dummy module that has the same API as multiprocessing, but will create and use a pool of worker threads instead. For example, one could initialize a thread pool object and call the map() method with two arguments: 1) function to scrape single page and 2) list of page URLs. The thread pool will automatically run the requests concurrently and return a list of results. You may not want to wait for all results to come at once and may prefer to incrementally receive them as they become ready. This can be achieved with imap() method of ThreadPool.

Trickster Dev

Code level discussion of web scraping, gray hat automation, growth hacking and bounty hunting


By rl1987, 2022-01-04