|
Content
Introduction
Getting started
General setting
Creating a New Project
How can you download a whole website
or any part of it onto a disk?
As most people, you have probably experienced
this problem at one time or another. The Internet Explorer or Netscape
Navigator were conceived for this purpose, to help you copy one page at
a time. But if the site consists of 1000 pages, you'd have to click your
mouse 1000 times and choose a directory 1000 times when you save the file.
Now another option is available: using the new version of the Website Extractor
program. All you have to do is enter the address of the website without
having to worry about downloading it. Then you just wait for a short time
until the program copies all or part of the website you have requested.
The Website Extractor program is conveniently
designed to download Internet websites exactly the way you want them, including
or excluding any parts you need or don't need (such as directory, domain
and file names, types of files, their size or any other properties).
The Extractor can download up to 100 files at
a time, which saves you a huge amount of time compared to ordinary browsers.
All data retrieved are stored in the directory you select and contain only
the files and directories matching your filter instructions.
A broad range of customized settings for downloading
web pages will enable you to limit the scope of files retrieved to such
types as jpg or html files.
Website Extractor automatically allows
you to download any files that were not copied due to transfer errors or
bad connections. The program is equipped to run through a proxy server
and download only revised or new files, bypassing documents that have already
been copied.
The Extractor is essentially a search robot and
is designed for fast-track navigation through the hyperlinks of cyber space,
downloading web pages at the user's request. It offers numerous settings
and options to facilitate this task. You can also limit your search by
domain types (such as com, net, uk, etc.) by using sophisticated filtering
options based on a list of key words and other auxiliary options.
To copy the Website Extractor program go to:
http://www.asona.org/
How the program works
Main menu
After downloading the program the main menu will
appear on your screen.
The main menu consists of the following options:
-
New - to start a new search
-
Open - to open a search in progress
-
Reopen - to quickly open one of the eight
searches recently initiated
-
Save - to save a search project that has been
initiated
-
Save as - to save a search project under a
new name
-
Default options - project options that run
by default
-
Exit - to exit the program
General settings
Before running the program it is advisable to
adjust the general settings.
To do this, launch the program and choose Default
Options.
Download files
Follow new links / URL
Copy subdirectory structure from website
Extract local link
Stay within initial domain list.
Links level limit
Number of connections
Save results automatically
Time out for one connection
Number of retries
Swap URL count
Does not visit twice already scanned site
Apply domainname.com = www.domainname.com
Expand the nodes parents to make the node visible
Identify browser as
File Type Filter: Limiting the types and sizes
of files
URL / Domain Filter
Domains: Limitations by domain type
The first thing to do is decide which directory (new
path) you will use to save project files and the path to the directory
for saving files copied (downloaded) from the Internet.
Download files - this option is used to
download files onto your hard drive.
Unless this option is highlighted the system
will only download a list of scanned hyperlinks into a special file.
Enter the proxy server properties (if you use a proxy server).
Then choose any other options you would like to use in downloading and
searching for hyperlinks.
Let's take a look at the various options available.
Follow new links / URL - to follow hyperlinks
automatically - this option allows you to automatically extract other websites
linked to the one you are scanning.
Copy subdirectory structure from website
- to copy the structure of a subdirectory from the website you wish to
download. If this option is highlighted your hard drive will be able to
create directories like the ones on the website you are downloading.
Extract local link - to search for local
hyperlinks. This option allows you to search for local links on the website
you are scanning, i.e. links that refer to other documents on the website.
Stay within initial domain list. - A very
convenient option that allows you to extract (not download) hyperlinks
(websites) not included in the original list of addresses. Here you should
decide whether you need to download other websites referred to from one
you are downloading. Using this option you will only download the files
you order. In this case the sites linked to the one you are investigating
will also be downloaded.
For example, you only need to download a list
of (URL) addresses
http://www.Internet-soft.com/DEMO
http://www.Esalesbiz.com/extra
…
and you don't need to download other domains
linked to the original list of domains (e.g. http://vista.ru).
Links level limit - number of downloading
levels - shows the number of steps involved in the hyperlinks.
An example will help to illustrate this option.
Let's assume there is a hyperlink from one site to another. There is a
link from the second link to the third, etc. As you can see, a number of
hyperlinks must be followed to get from one site to another. This option
gives you the greatest possible number of hyperlink steps. Each step enables
you to make some hyperlinks with a number of other websites. So if you
have selected only one level, you will only be able to copy the websites
(let's call them X1 websites) to which there is a link on the website you
are downloading (scanning), and not the sites with hyperlinks from X1 websites.
The following chart shows how the links level
limit works.
Number of connections - In this item you
enter the number of simultaneous connections.
As a rule 3 - 10 connections are made. The optimal
number of connections will depend on the number of lines you have and the
connection speed of your provider.
Save results automatically - To save your
results automatically every N of minutes.
This option shows how frequently your interim
search results are to be saved.
Time out for one connection - This option
gives the maximum amount of time in seconds during which each document
(one connection) is downloaded.
At the end of this time the program starts downloading
the next document.
Number of retries - The number of attempts
made to download each document.
This option shows the number of attempts to download
the same file if the provider connection or website link is broken off.
The program will make as many attempts to download as you specify.
Swap URL count - The number of addresses
added to the list of tasks (tree of downloadable addresses).
Does not visit twice already scanned site
- This option allows you not to scan the addresses which have already been
searched previously.
Apply domainname.com = www.domainname.com
In some sites the hyperlinks to other sites contain
no original www symbols and when the same documents are downloaded they
may be inscribed twice in different directories. This option is designed
to deal with this anomaly in Internet sites. If you highlight this option
INTERNET-SOFT.COM and WWW.INTERNET-SOFT.COM will be treated as synonymous
addresses. The address is automatically prefixed as www in this type of
search.
Expand the nodes parents to make the node visible
- This convenience option is intended to graphically represent the tree
of websites scanned.
In this way the option shows the current branches
of the site being downloaded and enables the program to graphically depict
the locations where sites are downloaded.
Identify browser as - This option shows
how the program will be identified when the website is downloaded by a
remote server.
For example, when you download a page using Internet
Explorer 5.0, the remote server performs this operations and writes the
contents of the server as a protocol. The Extractor program does the same
thing when you visit a website.
We would like to draw your attention to
the following:
Since the worldwide web contains a huge number
of pages great data processing power may be needed as well as a large amount
of disk space on your computer to download links and websites. A few hours
of work by the program may take up many gigabytes on your hard disk.
File Type Filter: Limiting the types
and sizes of files.
You can use this option to specify the types
of files you want to download and limit their size.
This is important, for example, when you only
want to download text documents without banners, pictures or archive files.
In this case, check the option beside html, htm,
txt and shtml, etc. files.
You can use these menu options to limit the size
of files to be downloaded. If you have selected "Load all file sizes",
files of all sizes will be downloaded. Otherwise you will only get the
sizes (specified in bytes) you have selected.
URL / Domain Filter: Limitations by names
of directories, domain names and files.
You can make limitations by entering certain
words in domains. Let's say you're downloading files only from www.yahoo.com.
You would only enter yahoo as the filter word.
The filter can be used separately:
-
to adjust the word content in a domain name;
-
to expand the domain;
-
to adjust the contents of a certain word in a directory
name;
-
to modify any given word in the file name.
The filter can be used to include and exclude. If
you have entered words into the exclude filter, this means that if the
URL contains any of these words, the corresponding files will not be downloaded.
If you opt for the include filter, this means that only the names containing
the properties specified in the word filter will be downloaded.
Domains: Limitations by domain type.
This option enables you to make limitations by
type and country of the domain.
To do this click on the requested domain type.
This is all you have to do for the main program settings. When you exit
the menu window you save by default the data you have entered and you can
proceed to download websites and e-mail addresses.
Now we can start a search project. The
default properties you have entered will automatically be called up when
you start a new project. These properties can be altered and saved for
a later time for each separate project.
The "search and download website" concept
is at the heart of this system. The term "project" therefore refers to
the total number of options that define which sites and properties are
to be downloaded.
WEBSITE DOWNLOADING
To start a new project,
press "New".
An interface to define
search and download (website) criteria will appear on your screen.
Then enter in the left
window the list of websites (URL) which you would like to download.
By pressing the Load button you can download the list from a text file.
By pressing "Options"
you enter the specific properties for your search project. Here you
should check your directories and make sure you have enough disk space
to download the required websites.
Search properties are entered
the same way as default properties.
After you have assigned
the search parameters, close this window and save the project by pressing
"save as". Then give the project file any name you choose.
For example, you could call
the site yahoo.pro.
Then proceed to download
the data by pressing the "Download" button.
The properties that are
most often changed are "download files" and "number of connections".
These properties are conveniently located at the top of the search window
toolbar to avoid having to exit this environment and enter data in another
window.
After downloading data we
recommend pressing "save as" or "save".
In this way you will be
able to reuse these properties, if needed, at a later time.
To access existing projects,
press "open" and choose the name of the project.
You can re-download websites
by pressing the "download" button or by continuing the download
from the last site you have visited. To do this select the proper
file on the "download tree" and press "Resume".
The Extractor program is developed by Asona.
|