By apogne


2014-10-25 20:14:30 8 Comments

I want to scrape all the data of a page implemented by a infinite scroll. The following python code works.

for i in range(100):
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(5)

This means every time I scroll down to the bottom, I need to wait 5 seconds, which is generally enough for the page to finish loading the newly generated contents. But, this may not be time efficient. The page may finish loading the new contents within 5 seconds. How can I detect whether the page finished loading the new contents every time I scroll down? If I can detect this, I can scroll down again to see more contents once I know the page finished loading. This is more time efficient.

9 comments

@ahmed abdelmalek 2018-10-27 15:44:53

Here I did it using a rather simple form:

from selenium import webdriver
browser = webdriver.Firefox()
browser.get("url")
searchTxt=''
while not searchTxt:
    try:    
      searchTxt=browser.find_element_by_name('NAME OF ELEMENT')
      searchTxt.send_keys("USERNAME")
    except:continue

@seeiespi 2018-05-13 04:36:48

Have you tried driver.implicitly_wait. It is like a setting for the driver, so you only call it once in the session and it basically tells the driver to wait the given amount of time until each command can be executed.

driver = webdriver.Chrome()
driver.implicitlyWait(10)

So if you set a wait time of 10 seconds it will execute the command as soon as possible, waiting 10 seconds before it gives up. I've used this in similar scroll-down scenarios so I don't see why it wouldn't work in your case. Hope this is helpful :)

@kenorb 2015-05-21 23:09:40

Find below 3 methods:

readyState

Checking page readyState (not reliable):

def page_has_loaded(self):
    self.log.info("Checking if {} page is loaded.".format(self.driver.current_url))
    page_state = self.driver.execute_script('return document.readyState;')
    return page_state == 'complete'

The wait_for helper function is good, but unfortunately click_through_to_new_page is open to the race condition where we manage to execute the script in the old page, before the browser has started processing the click, and page_has_loaded just returns true straight away.

id

Comparing new page ids with the old one:

def page_has_loaded_id(self):
    self.log.info("Checking if {} page is loaded.".format(self.driver.current_url))
    try:
        new_page = browser.find_element_by_tag_name('html')
        return new_page.id != old_page.id
    except NoSuchElementException:
        return False

It's possible that comparing ids is not as effective as waiting for stale reference exceptions.

staleness_of

Using staleness_of method:

@contextlib.contextmanager
def wait_for_page_load(self, timeout=10):
    self.log.debug("Waiting for page to load at {}.".format(self.driver.current_url))
    old_page = self.find_element_by_tag_name('html')
    yield
    WebDriverWait(self, timeout).until(staleness_of(old_page))

For more details, check Harry's blog.

@Arthur Hebert 2018-04-02 23:00:12

Why do you say that self.driver.execute_script('return document.readyState;') not reliable? It seems to work perfectly for my use case, which is waiting for a static file to load in a new tab (which is opened via javascript in another tab instead of .get()).

@kenorb 2018-04-03 09:40:33

@ArthurHebert Could be not reliable due to race condition, I've added relevant cite.

@Zeinab Abbasimazar 2014-10-25 21:44:05

The webdriver will wait for a page to load by default via .get() method.

As you may be looking for some specific element as @user227215 said, you should use WebDriverWait to wait for an element located in your page:

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException

browser = webdriver.Firefox()
browser.get("url")
delay = 3 # seconds
try:
    myElem = WebDriverWait(browser, delay).until(EC.presence_of_element_located((By.ID, 'IdOfMyElement')))
    print "Page is ready!"
except TimeoutException:
    print "Loading took too much time!"

I have used it for checking alerts. You can use any other type methods to find the locator.

EDIT 1:

I should mention that the webdriver will wait for a page to load by default. It does not wait for loading inside frames or for ajax requests. It means when you use .get('url'), your browser will wait until the page is completely loaded and then go to the next command in the code. But when you are posting an ajax request, webdriver does not wait and it's your responsibility to wait an appropriate amount of time for the page or a part of page to load; so there is a module named expected_conditions.

@apogne 2014-10-25 22:16:25

What is "IdOfMyElement"? Is it something I should predict like the index of something will be loaded newly? For example, I want to crawl the following page: pinterest.com/cremedelacrumb/yum

@Zeinab Abbasimazar 2014-10-25 22:22:49

You should find an element in your page which you're sure that always exists in the page. "IdOfMyElement" refers to an element's ID in the page; if it doesn't own an ID, you can use any other type of locator, like xpath.

@apogne 2014-10-26 02:27:38

I think it should not be something always existing. It should be something that will be newly loaded once scrolling down. Am I right? For example, can you tell me what is this element of the page I said before?

@Zeinab Abbasimazar 2014-10-26 17:44:38

The link <a href="/" id="logo" class="logo" data-force-refresh="1" data-element-type="146">Pinterest</a> is such an element in the link you have provided. BTW, chekout my edit.

@fragles 2015-09-11 09:29:09

I was getting "find_element() argument after * must be a sequence, not WebElement" changed to "WebDriverWait(browser, delay).until(EC.presence_of_element_located((By.ID, "IdOfMyElement"))) " see manual selenium-python.readthedocs.org/en/latest/waits.html

@Michael Ohlrogge 2016-05-20 19:13:44

The comment by @fragles and the answer by David Cullen were what worked for me. Perhaps this accepted answer could be updated accordingly?

@David Cullen 2016-06-06 12:52:31

Passing browser.find_element_by_id('IdOfMyElement') causes a NoSuchElementException to be raised. The documentation says to pass a tuple that looks like this: (By.ID, 'IdOfMyElement'). See my answer

@Ben Wilson 2016-12-01 22:52:39

Hopefully this helps someone else out because it wasn't clear to me initially: WebDriverWait will actually return a web object that you can then perform an action on (e.g. click()), read text out of etc. I was under the mistaken impression that it just caused a wait, after which you still had to find the element. If you do a wait, then a find element afterward, selenium will error out because it tries to find the element while the old wait is still processing (hopefully that makes sense). Bottom line is, you don't need to find the element after using WebDriverWait -- it is already an object.

@Petar Vasilev 2017-04-16 08:57:08

Does the webdriver wait for the images to be loaded before continuing with the rest of the script?

@Zeinab Abbasimazar 2017-04-16 09:36:29

@PetarVasilev, if you're referring to get method, you can read this answer.

@raffamaiden 2017-07-09 16:18:52

On a side note, instead of scrolling down 100 times, you can check if there are no more modifications to the DOM (we are in the case of the bottom of the page being AJAX lazy-loaded)

def scrollDown(driver, value):
    driver.execute_script("window.scrollBy(0,"+str(value)+")")

# Scroll down the page
def scrollDownAllTheWay(driver):
    old_page = driver.page_source
    while True:
        logging.debug("Scrolling loop")
        for i in range(2):
            scrollDown(driver, 500)
            time.sleep(2)
        new_page = driver.page_source
        if new_page != old_page:
            old_page = new_page
        else:
            break
    return True

@Moondra 2018-02-22 23:51:01

This is useful. However what does the 500 represent? Is it big enough to get to the end of the page?

@raffamaiden 2018-02-26 11:06:52

It's the amount the page should scroll ... you should set it as high as possible. I just found out that this number was enough for me, since it makes the page scroll till the bottom until AJAX elements are lazy-loaded, spurring the need to re-load the page again

@Carl 2017-01-26 12:17:08

From selenium/webdriver/support/wait.py

driver = ...
from selenium.webdriver.support.wait import WebDriverWait
element = WebDriverWait(driver, 10).until(
    lambda x: x.find_element_by_id("someId"))

@Rao 2017-05-08 06:44:07

How about putting WebDriverWait in While loop and catching the exceptions.

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException

browser = webdriver.Firefox()
browser.get("url")
delay = 3 # seconds
while True:
    try:
        WebDriverWait(browser, delay).until(EC.presence_of_element_located(browser.find_element_by_id('IdOfMyElement')))
        print "Page is ready!"
        break # it will break from the loop once the specific element will be present. 
    except TimeoutException:
        print "Loading took too much time!-Try again"

@Corey Goldberg 2018-11-10 20:16:41

you dont need the loop?

@David Cullen 2016-05-18 14:49:05

Trying to pass find_element_by_id to the constructor for presence_of_element_located (as shown in the accepted answer) caused NoSuchElementException to be raised. I had to use the syntax in fragles' comment:

from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

driver = webdriver.Firefox()
driver.get('url')
timeout = 5
try:
    element_present = EC.presence_of_element_located((By.ID, 'element_id'))
    WebDriverWait(driver, timeout).until(element_present)
except TimeoutException:
    print "Timed out waiting for page to load"

This matches the example in the documentation. Here is a link to the documentation for By.

@Michael Ohlrogge 2016-05-20 19:11:38

Thank you! yes, this was needed for me too. ID isn't the only attribute that can be used, to get full list, use help(By). E.g. I used EC.presence_of_element_located((By.XPATH, "//*[@title='Check All Q1']"))

@J0ANMM 2016-10-14 07:21:37

That's the way it works for me as well! I wrote an additional answer expanding on the different locators that are available with the By object.

@Liquidgenius 2018-08-01 20:10:35

I've posted a followup question dealing with expectations where different pages may be loaded, and not always the same page: stackoverflow.com/questions/51641546/…

@J0ANMM 2016-10-14 07:19:32

As mentioned in the answer from David Cullen, I've seen always recommended using a line like the following one:

element_present = EC.presence_of_element_located((By.ID, 'element_id'))
    WebDriverWait(driver, timeout).until(element_present)

It was difficult for me to find anywhere all possible locators that can be used with the By syntax, so I thought it would be useful to provide here the list. According to Web Scraping with Python by Ryan Mitchell:

ID

Used in the example; finds elements by their HTML id attribute

CLASS_NAME

Used to find elements by their HTML class attribute. Why is this function CLASS_NAME not simply CLASS? Using the form object.CLASS would create problems for Selenium's Java library, where .class is a reserved method. In order to keep the Selenium syntax consistent between different languages, CLASS_NAME was used instead.

CSS_SELECTOR

Find elements by their class, id, or tag name, using the #idName, .className, tagName convention.

LINK_TEXT

Finds HTML tags by the text they contain. For example, a link that says "Next" can be selected using (By.LINK_TEXT, "Next").

PARTIAL_LINK_TEXT

Similar to LINK_TEXT, but matches on a partial string.

NAME

Finds HTML tags by their name attribute. This is handy for HTML forms.

TAG_NAME

Fins HTML tags by their tag name.

XPATH

Uses an XPath expression ... to select matching elements.

@David Cullen 2016-10-14 15:07:15

The documentation for By lists the attributes which can be used as locators.

@J0ANMM 2016-10-14 16:05:15

That was what I had been looking for! Thanks! Well, now it should be easier to find as google was sending me to this question, but not to the official documentation.

Related Questions

Sponsored Content

46 Answered Questions

[SOLVED] Wait for page load in Selenium

21 Answered Questions

[SOLVED] Peak detection in a 2D array

0 Answered Questions

Python Selenium - Adjust pause_time to scroll down in infinite page

13 Answered Questions

1 Answered Questions

[SOLVED] How to wait only DOM to load with Selenium (without images)

2 Answered Questions

[SOLVED] How to scroll to the end of the page using selenium in python

  • 2015-09-04 06:18:35
  • Prabhjot Rai
  • 7896 View
  • 6 Score
  • 2 Answer
  • Tags:   python selenium

1 Answered Questions

[SOLVED] how to wait for a specific command to finish in selenium

  • 2017-06-06 10:35:49
  • DoctorEvil
  • 278 View
  • 1 Score
  • 1 Answer
  • Tags:   python selenium

2 Answered Questions

[SOLVED] Stop the Scroll in Dynamic Page with Selenium in Python

1 Answered Questions

How to wait until page freezes in Selenium WebDriver?

Sponsored Content