Skip to content. | Skip to navigation

Our modified regex-urlfilter.xml

last modified Mar 26, 2012 02:00 PM
This file was modified from the version included with the Nutch binary distribution. You will likely want to modify it for your own institution.

Extensible Markup Language (XML) icon regex-urlfilter.xml — Extensible Markup Language (XML), 1 KB (1824 bytes)

File contents

# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


# The default url filter.
# Better for whole-internet crawling.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|mpg|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]
# Modified for Millennium crawling...

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
#-.*(/[^/]+)/[^/]+\1/[^/]+\1/
# skip the plone search form
-.*search_form
# skip urls that end in a trailing slash.
-^http://libraries.mercer.edu/.*/$
# accept anything else
#+.
-^http://library.mercer.edu/.*request.*
-^http://library.mercer.edu/.*save.*
-^http://library.mercer.edu/.*marc.*
+^http://library.mercer.edu/search~S1.*ftlist

Personal tools
staff intranet Library Staff