linux_old1

Commit Graph

Author	SHA1	Message	Date
Eric W. Biederman	458878a705	nfsd: Convert nfs3xdr to use kuids and kgids When reading uids and gids off the wire convert them to kuids and kgids. When putting kuids and kgids onto the wire first convert them to uids and gids the other side will understand. Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2013-02-13 06:16:04 -08:00
Eric W. Biederman	e097258f2e	nfsd: Remove nfsd_luid, nfsd_lgid, nfsd_ruid and nfsd_rgid These trivial macros that don't currently do anything are the last vestiages of an old attempt at uid mapping that was removed from the kernel in September of 2002. Remove them to make it clear what the code is currently doing. Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2013-02-13 06:15:51 -08:00
Eric W. Biederman	65e10f6d0a	nfsd: Convert idmap to use kuids and kgids Convert nfsd_map_name_to_uid to return a kuid_t value. Convert nfsd_map_name_to_gid to return a kgid_t value. Convert nfsd_map_uid_to_name to take a kuid_t parameter. Convert nfsd_map_gid_to_name to take a kgid_t paramater. Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2013-02-13 06:15:49 -08:00
Eric W. Biederman	b5663898ec	nfsd: idmap use u32 not uid_t as the intermediate type u32 and uid_t have the same size and semantics so this change should have no operational effect. This just removes the WTF factor when looking at variables that hold both uids and gids whos type is uid_t. Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2013-02-13 06:15:37 -08:00
Eric W. Biederman	6c1810e040	nfsd: Remove declaration of nonexistent nfs4_acl_permisison Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2013-02-13 06:15:35 -08:00
Eric W. Biederman	9ff593c473	nfs: kuid and kgid conversions for nfs/inode.c - Use uid_eq and gid_eq when comparing kuids and kgids. - Use make_kuid(&init_user_ns, -2) and make_kgid(&init_user_ns, -2) as the initial uid and gid on nfs inodes, instead of using the typeunsafe value of -2. Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2013-02-13 06:15:33 -08:00
Eric W. Biederman	e5782076e7	nfs: Convert nfs4xdr to use kuids and kgids When reading uids and gids off the wire convert them to kuids and kgids. When putting kuids and kgids onto the wire first convert them to uids and gids the other side will understand. When printing kuids and kgids convert them to values in the initial user namespace then use normal printf formats. Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2013-02-13 06:15:32 -08:00
Eric W. Biederman	57a38dae2a	nfs: Convert nfs3xdr to use kuids and kgids When reading uids and gids off the wire convert them to kuids and kgids. When putting kuids and kgids onto the wire first convert them to uids and gids the other side will understand. Add an additional failure mode incoming for uids or gids that are invalid. Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2013-02-13 06:15:31 -08:00
Eric W. Biederman	cfa0898d4f	nfs: Convert nfs2xdr to use kuids and kgids When reading uids and gids off the wire convert them to kuids and kgids. When putting kuids and kgids onto the wire first convert them to uids and gids the other side will understand. Add an additional failure mode for incoming uid or gids that are invalid. Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2013-02-13 06:15:30 -08:00
Eric W. Biederman	9f309c86cf	nfs: Convert idmap to use kuids and kgids Convert nfs_map_name_to_uid to return a kuid_t value. Convert nfs_map_name_to_gid to return a kgid_t value. Convert nfs_map_uid_to_name to take a kuid_t paramater. Convert nfs_map_gid_to_name to take a kgid_t paramater. Tweak nfs_fattr_map_owner_to_name to use a kuid_t intermediate value. Tweak nfs_fattr_map_group_to_name to use a kgid_t intermediate value. Which makes these functions properly handle kuids and kgids, including erroring of the generated kuid or kgid is invalid. Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2013-02-13 06:15:29 -08:00
Eric W. Biederman	4e963d4f3e	nfs: Pass GLOBAL_ROOT_UID and GLOBAL_ROOT_GID to keyring alloc Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2013-02-13 06:15:27 -08:00
Eric W. Biederman	ddca4e1730	nfs_common: Update the translation between nfsv3 acls linux posix acls - Use kuid_t and kgit in struct nfsacl_encode_desc. - Convert from kuids and kgids when generating on the wire values. - Convert on the wire values to kuids and kgids when read. - Modify cmp_acl_entry to be type safe comparison on posix acls. Only acls with type ACL_USER and ACL_GROUP can appear more than once and as such need to compare more than their tag. - The e_id field is being removed from posix acls so don't initialize it. Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2013-02-13 06:15:14 -08:00
Eric W. Biederman	1ac7fd8190	ncpfs: Support interacting with multiple user namespaces ncpfs does not natively support uids and gids so this conversion was simply a matter of updating the the type of the mounteduid, the uid and the gid on the superblock. Fixing the ioctls that read them, updating the mount option parser and the mount option printer. Cc: Petr Vandrovec <petr@vandrovec.name> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2013-02-13 06:15:13 -08:00
Eric W. Biederman	d054642642	gfs2: Convert uids and gids between dinodes and vfs inodes. When reading dinodes from the disk convert uids and gids into kuids and kgids to store in vfs data structures. When writing to dinodes to the disk convert kuids and kgids in the in memory structures into plain uids and gids. For now all on disk data structures are assumed to be stored in the initial user namespace. Cc: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2013-02-13 06:15:11 -08:00
Eric W. Biederman	6b24c0d279	gfs2: Use uid_eq and gid_eq where appropriate Where kuid_t values are compared use uid_eq and where kgid_t values are compared use gid_eq. This is unfortunately necessary because of the type safety that keeps someone from accidentally mixing kuids and kgids with other types. Cc: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2013-02-13 06:15:10 -08:00
Eric W. Biederman	7c06b5d672	gfs2: Use kuid_t and kgid_t types where appropriate. Cc: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2013-02-13 06:15:09 -08:00
Eric W. Biederman	236c64e4b7	gfs2: Remove the QUOTA_USER and QUOTA_GROUP defines Remove the QUOTA_USER and QUOTA_GRUP defines. Remove the last vestigal users of QUOTA_USER and QUOTA_GROUP. Now that struct kqid is used throughout the gfs2 quota code the need there is to use QUOTA_USER and QUOTA_GROUP and the defines are just extraneous and confusing. Cc: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2013-02-13 06:15:08 -08:00
Eric W. Biederman	05e0a60d80	gfs2: Store qd_id in struct gfs2_quota_data as a struct kqid - Change qd_id in struct gfs2_qutoa_data to struct kqid. - Remove the now unnecessary QDF_USER bit field in qd_flags. - Propopoage this change through the code generally making things simpler along the way. Cc: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2013-02-13 06:15:07 -08:00
Eric W. Biederman	ed87dabcc3	gfs2: Convert gfs2_quota_refresh to take a kqid - In quota_refresh_user_store convert the user supplied uid into a kqid and pass it to gfs2_quota_refresh. - In quota_refresh_group_store convert the user supplied gid into a kqid and pass it to gfs2_quota_refresh. Cc: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2013-02-13 06:15:06 -08:00
Eric W. Biederman	b59c8b6f9d	gfs2: Modify qdsb_get to take a struct kqid Cc: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2013-02-13 06:15:05 -08:00
Eric W. Biederman	e08d8d7f20	gfs2: Modify struct gfs2_quota_change_host to use struct kqid Cc: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2013-02-13 06:15:04 -08:00
Eric W. Biederman	2f6c9896f7	gfs2: Introduce qd2index Both qd_alloc and qd2offset perform the exact same computation to get an index from a gfs2_quota_data. Make life a little simpler and factor out this index computation. Cc: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2013-02-13 06:15:03 -08:00
Eric W. Biederman	558e85289f	gfs2: Report quotas in the caller's user namespace. When a quota is queried return the uid or the gid in the mapped into the caller's user namespace. In addition perform the munged version of the mapping so that instead of -1 a value that does not map is reported as the overflowuid or the overflowgid. Cc: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2013-02-13 06:15:02 -08:00
Eric W. Biederman	f4108a607f	gfs2: Split NO_QUOTA_CHANGE inot NO_UID_QUTOA_CHANGE and NO_GID_QUTOA_CHANGE Split NO_QUOTA_CHANGE into NO_UID_QUTOA_CHANGE and NO_GID_QUTOA_CHANGE so the constants may be well typed. Cc: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2013-02-13 06:15:01 -08:00
Eric W. Biederman	393551e989	gfs2: Remove improper checks in gfs2_set_dqblk. In set_dqblk it is an error to look at fdq->d_id or fdq->d_flags. Userspace quota applications do not set these fields when calling quotactl(Q_XSETQLIM,...), and the kernel does not set those fields when quota_setquota calls set_dqblk. gfs2 never looks at fdq->d_id or fdq->d_flags after checking to see if they match the id and type supplied to set_dqblk. No other linux filesystem in set_dqblk looks at either fdq->d_id or fdq->d_flags. Therefore remove these bogus checks from gfs2 and allow normal quota setting applications to work. Cc: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2013-02-13 06:15:00 -08:00
Eric W. Biederman	488c8ef033	ocfs2: Compare kuids and kgids using uid_eq and gid_eq Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2013-02-13 06:01:00 -08:00
Eric W. Biederman	ba6135609c	ocfs2: For tracing report the uid and gid values in the initial user namespace Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2013-02-13 06:00:59 -08:00
Eric W. Biederman	2c03417627	ocfs2: Convert uid and gids between in core and on disk inodes Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2013-02-13 06:00:58 -08:00
Eric W. Biederman	03ab30f73d	ocfs2: convert between kuids and kgids and DLM locks Convert between uid and gids stored in the on the wire format of dlm locks aka struct ocfs2_meta_lvb and kuids and kgids stored in inode->i_uid and inode->i_gid. Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2013-02-13 06:00:57 -08:00
Eric W. Biederman	9522751cde	ocfs2: Handle kuids and kgids in acl/xattr conversions. Explicitly deal with the different kinds of acls because they need different conversions. Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2013-02-13 06:00:56 -08:00
Eric W. Biederman	17499e3329	coda: Cache permisions in struct coda_inode_info in a kuid_t. - Change c_uid in struct coda_indoe_info from a vuid_t to a kuid_t. - Initialize c_uid to GLOBAL_ROOT_UID instead of 0. - Use uid_eq to compare cached kuids. Cc: Jan Harkes <jaharkes@cs.cmu.edu> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2013-02-13 06:00:54 -08:00
Eric W. Biederman	d83f5901bc	coda: Restrict coda messages to the initial user namespace Remove the slight chance that uids and gids in coda messages will be interpreted in the wrong user namespace. - Only allow processes in the initial user namespace to open the coda character device to communicate with coda filesystems. - Explicitly convert the uids in the coda header into the initial user namespace. - In coda_vattr_to_attr make kuids and kgids from the initial user namespace uids and gids in struct coda_vattr that just came from userspace. - In coda_iattr_to_vattr convert kuids and kgids into uids and gids in the intial user namespace and store them in struct coda_vattr for sending to coda userspace programs. Nothing needs to be changed with mounts as coda does not support being mounted in anything other than the initial user namespace. Cc: Jan Harkes <jaharkes@cs.cmu.edu> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2013-02-13 06:00:53 -08:00
Eric W. Biederman	9fd973e085	coda: Restrict coda messages to the initial pid namespace Remove the slight chance that pids in coda messages will be interpreted in the wrong pid namespace. - Explicitly send all pids in coda messages in the initial pid namespace. - Only allow mounts from processes in the initial pid namespace. - Only allow processes in the initial pid namespace to open the coda character device to communicate with coda. Cc: Jan Harkes <jaharkes@cs.cmu.edu> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2013-02-13 06:00:52 -08:00
Eric W. Biederman	a0a5386ac6	afs: Support interacting with multiple user namespaces Modify struct afs_file_status to store owner as a kuid_t and group as a kgid_t. In xdr_decode_AFSFetchStatus as owner is now a kuid_t and group is now a kgid_t don't use the EXTRACT macro. Instead perform the work of the extract macro explicitly. Read the value with ntohl and convert it to the appropriate type with make_kuid or make_kgid. Test if the value is different from what is stored in status and update changed. Update the value in status. In xdr_encode_AFS_StoreStatus call from_kuid or from_kgid as we are computing the on the wire encoding. Initialize uids with GLOBAL_ROOT_UID instead of 0. Initialize gids with GLOBAL_ROOT_GID instead of 0. Cc: David Howells <dhowells@redhat.com> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2013-02-13 06:00:51 -08:00
Eric W. Biederman	f74f70f8b1	afs: Only allow mounting afs in the intial network namespace rxrpc sockets only work in the initial network namespace so it isn't possible to support afs in any other network namespace. Cc: David Howells <dhowells@redhat.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2013-02-13 06:00:38 -08:00
Steven Whitehouse	fd95e81cb1	GFS2: Reinstate withdraw ack system This patch reinstates the ack system which withdraw should be using. It appears to have been accidentally forgotten when the lock module was merged into GFS2, due to two different sysfs files having the same name. Reported-by: David Teigland <teigland@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2013-02-13 12:21:40 +00:00
Josh Boyer	fb0af3f2b1	pstore: Create a convenient mount point for pstore Using /dev/pstore as a mount point for the pstore filesystem is slightly awkward. We don't normally mount filesystems in /dev/ and the /dev/pstore file isn't created automatically by anything. While this method will still work, we can create a persistent mount point in sysfs. This will put pstore on par with things like cgroups and efivarfs. Signed-off-by: Josh Boyer <jwboyer@redhat.com> Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: Tony Luck <tony.luck@intel.com>	2013-02-12 13:07:22 -08:00
Eric W. Biederman	66fdb93f88	afs: Remove unused structure afs_store_status While looking for kuid_t and kgid_t conversions I found this structure that has never been used since it was added to the kernel in 2007. The obvious for this structure to be used is in xdr_encode_AFS_StoreStatus and that function uses a small handful of local variables instead. So remove the unnecessary structure to prevent confusion. Cc: David Howells <dhowells@redhat.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2013-02-12 03:19:36 -08:00
Eric W. Biederman	d4ef4e3581	9p: Modify v9fs_get_fsgid_for_create to return a kgid Modify v9fs_get_fsgid_for_create to return a kgid and modify all of the variables that hold the result of v9fs_get_fsgid_for_create to be of type kgid_t. Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Ron Minnich <rminnich@gmail.com> Cc: Latchesar Ionkov <lucho@ionkov.net> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2013-02-12 03:19:34 -08:00
Eric W. Biederman	76ed23a5d7	9p: Modify struct v9fs_session_info to use a kuids and kgids Change struct v9fs_session_info and the code that popluates it to use kuids and kgids. When parsing the 9p mount options convert the dfltuid, dflutgid, and the session uid from the current user namespace into kuids and kgids. Modify V9FS_DEFUID and V9FS_DEFGUID to be kuid and kgid values. Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Ron Minnich <rminnich@gmail.com> Cc: Latchesar Ionkov <lucho@ionkov.net> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2013-02-12 03:19:33 -08:00
Eric W. Biederman	b464255699	9p: Modify struct 9p_fid to use a kuid_t not a uid_t Change struct 9p_fid and it's associated functions to use kuid_t's instead of uid_t. Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Ron Minnich <rminnich@gmail.com> Cc: Latchesar Ionkov <lucho@ionkov.net> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2013-02-12 03:19:32 -08:00
Eric W. Biederman	447c50943f	9p: Modify the stat structures to use kuid_t and kgid_t 9p has thre strucrtures that can encode inode stat information. Modify all of those structures to contain kuid_t and kgid_t values. Modify he wire encoders and decoders of those structures to use 'u' and 'g' instead of 'd' in the format string where uids and gids are present. This results in all kuid and kgid conversion to and from on the wire values being performed by the same code in protocol.c where the client is known at the time of the conversion. Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Ron Minnich <rminnich@gmail.com> Cc: Latchesar Ionkov <lucho@ionkov.net> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2013-02-12 03:19:31 -08:00
Eric W. Biederman	f791f7c5e3	9p: Transmit kuid and kgid values Modify the p9_client_rpc format specifiers of every function that directly transmits a uid or a gid from 'd' to 'u' or 'g' as appropriate. Modify those same functions to take kuid_t and kgid_t parameters instead of uid_t and gid_t parameters. Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Ron Minnich <rminnich@gmail.com> Cc: Latchesar Ionkov <lucho@ionkov.net> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2013-02-12 03:19:30 -08:00
Eric W. Biederman	bd2bae6a66	ceph: Convert kuids and kgids before printing them. Before printing kuid and kgids values convert them into the initial user namespace. Cc: Sage Weil <sage@inktank.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2013-02-12 03:19:27 -08:00
Eric W. Biederman	ff3d004662	ceph: Convert struct ceph_mds_request to use kuid_t and kgid_t Hold the uid and gid for a pending ceph mds request using the types kuid_t and kgid_t. When a request message is finally created convert the kuid_t and kgid_t values into uids and gids in the initial user namespace. Cc: Sage Weil <sage@inktank.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2013-02-12 03:19:26 -08:00
Eric W. Biederman	ab871b903e	ceph: Translate inode uid and gid attributes to/from kuids and kgids. - In fill_inode() transate uids and gids in the initial user namespace into kuids and kgids stored in inode->i_uid and inode->i_gid. - In ceph_setattr() if they have changed convert inode->i_uid and inode->i_gid into initial user namespace uids and gids for transmission. Cc: Sage Weil <sage@inktank.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2013-02-12 03:19:25 -08:00
Eric W. Biederman	05cb11c17e	ceph: Translate between uid and gids in cap messages and kuids and kgids - Make the uid and gid arguments of send_cap_msg() used to compose ceph_mds_caps messages of type kuid_t and kgid_t. - Pass inode->i_uid and inode->i_gid in __send_cap to send_cap_msg() through variables of type kuid_t and kgid_t. - Modify struct ceph_cap_snap to store uids and gids in types kuid_t and kgid_t. This allows capturing inode->i_uid and inode->i_gid in ceph_queue_cap_snap() without loss and pssing them to __ceph_flush_snaps() where they are removed from struct ceph_cap_snap and passed to send_cap_msg(). - In handle_cap_grant translate uid and gids in the initial user namespace stored in struct ceph_mds_cap into kuids and kgids before setting inode->i_uid and inode->i_gid. Cc: Sage Weil <sage@inktank.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2013-02-12 03:19:24 -08:00
Trond Myklebust	c8da19b986	NFSv4.1: Fix an ABBA locking issue with session and state serialisation Ensure that if nfs_wait_on_sequence() causes our rpc task to wait for an NFSv4 state serialisation lock, then we also drop the session slot. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: stable@vger.kernel.org	2013-02-11 19:04:25 -05:00
Jaegeuk Kim	7dd690c820	f2fs: avoid build warning This patch removes the following build warning: fs/f2fs/node.c: warning: 'nofs' may be used uninitialized in this function [-Wuninitialized]: => 738:8 Note that this is a false alarm. Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>	2013-02-12 07:28:55 +09:00
Jaegeuk Kim	90b2fc64f0	Merge branch 'f2fs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs into dev Pull f2fs cleanup patches from Al Viro: f2fs: get rid of fake on-stack dentries f2fs: switch init_inode_metadata() to passing parent and name separately f2fs: switch new_inode_page() from dentry to qstr f2fs: init_dent_inode() should take qstr Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com> Conflicts: fs/f2fs/recovery.c	2013-02-12 07:17:20 +09:00
Namjae Jeon	e975082411	f2fs: add compat_ioctl to provide backward compatability adding compat_ioctl to provide support for backward comptability - 32bit binary execution on 64bit kernel. Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>	2013-02-12 07:15:02 +09:00
Jaegeuk Kim	b7250d2d84	f2fs: fix calculation of max. gc cost in the SSR case In the SSR case, the max gc cost should be the number of pages in a segment. Otherwise, f2fs is able to fail getting dirty segments frequently for SSR. In get_victim_by_default() previously, while(1) { ... cost = get_gc_cost(); <- cost is between 0 ~ 512. ... if (cost == get_max_cost(sbi, &p)) <- max cost is UINT_MAX due to GC_CB type continue; if (nsearched++ >= MAX_VICTIM_SEARCH) break; } So, if there are a number of fully valid segments in series, f2fs cannot skip those segments by comparing the cost and max cost of each segment. Note that, the cost is the number of valid blocks at the time of the last checkpoint. Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>	2013-02-12 07:15:02 +09:00
Jaegeuk Kim	437275272f	f2fs: clarify and enhance the f2fs_gc flow This patch makes clearer the ambiguous f2fs_gc flow as follows. 1. Remove intermediate checkpoint condition during f2fs_gc (i.e., should_do_checkpoint() and GC_BLOCKED) 2. Remove unnecessary return values of f2fs_gc because of #1. (i.e., GC_NODE, GC_OK, etc) 3. Simplify write_checkpoint() because of #2. 4. Clarify the main f2fs_gc flow. o monitor how many freed sections during one iteration of do_garbage_collect(). o do GC more without checkpoints if we can't get enough free sections. o do checkpoint once we've got enough free sections through forground GCs. 5. Adopt thread-logging (Slack-Space-Recycle) scheme more aggressively on data log types. See. get_ssr_segement() Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>	2013-02-12 07:15:02 +09:00
Namjae Jeon	b1f1daf8c7	f2fs: optimize the return condition for has_not_enough_free_secs Instead of evaluating the free_sections and then deciding to return true/false from that path. We can directly use the evaluation condition for returning proper value. Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>	2013-02-12 07:15:02 +09:00
Namjae Jeon	5ac206cf4f	f2fs: make an accessor to get sections for particular block type Introduce accessor to get the sections based upon the block type (node,dents...) and modify the functions : should_do_checkpoint, has_not_enough_free_secs to use this accessor function to get the node sections and dent sections. Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>	2013-02-12 07:15:02 +09:00
Namjae Jeon	25718423ea	f2fs: mark gc_thread as NULL when thread creation is failed When gc thread creation is failed, mark gc_thread as NULL to avoid crash while trying to stop invalid thread in stop_gc_thread->kthread_stop. Instead make it return from: if (!gc_th) return; Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>	2013-02-12 07:15:02 +09:00
Namjae Jeon	ec7b1f2dd1	f2fs: name gc task as per the block device Currently GC task is started for each f2fs formatted/mounted device. But, when we check the task list, using 'ps', there is no distinguishing factor between the tasks. So, name the task as per the block device just like the flusher threads. Also, remove the macro GC_THREAD_NAME and instead use the name: f2fs_gc to avoid name length truncation, as the command length is 16 -> TASK_COMM_LEN 16 and example name like: f2fs_gc_task:8:16 -> this exceeds name length Before Patch for 2 F2FS formatted partitions: root 28061 0.0 0.0 0 0 ? S 10:31 0:00 [f2fs_gc_task] root 28087 0.0 0.0 0 0 ? S 10:32 0:00 [f2fs_gc_task] After Patch: root 16756 0.0 0.0 0 0 ? S 14:57 0:00 [f2fs_gc-8:18] root 16765 0.0 0.0 0 0 ? S 14:57 0:00 [f2fs_gc-8:19] Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>	2013-02-12 07:15:02 +09:00
Changman Lee	48600e44c1	f2fs: remove unnecessary gc option check and balance_fs 1. If f2fs is mounted with background_gc_off option, checking BG_GC is not redundant. 2. f2fs_balance_fs is checked in f2fs_gc, so this is also redundant. Signed-off-by: Changman Lee <cm224.lee@samsung.com> Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>	2013-02-12 07:15:02 +09:00
Changman Lee	94787d91cb	f2fs: remove repeated F2FS_SET_SB_DIRT call F2FS_SET_SB_DIRT is called in inc_page_count and it is directly called one more time in the next line. Signed-off-by: Changman Lee <cm224.lee@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>	2013-02-12 07:15:01 +09:00
majianpeng	14d7e9de05	f2fs: when check superblock failed, try to check another superblock In f2fs, there are two superblocks. So when the first superblock was invalidate, it should try to check another. By Jaegeuk Kim: o Remove a white space for coding style o Clean up for code readability o Fix a typo Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>	2013-02-12 07:15:01 +09:00
majianpeng	5c9b469295	f2fs: use F2FS_BLKSIZE to judge bloksize and page_cache_size In some system PAGE_CACHE_SIZE isn't 4K. So using F2FS_BLKSIZE to judge. By Jaegeuk Kim: o f2fs does not support no other 4KB page cache size. Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>	2013-02-12 07:15:01 +09:00
majianpeng	f83759e283	f2fs: add device name in debugfs In file status, it can't distinguish between different devices. So add device name to do this function. Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>	2013-02-12 07:15:01 +09:00
Changman Lee	facb020540	f2fs: stop repeated checking if cp is needed If it is decided that f2fs should do checkpoint, skip next comparison. Signed-off-by: Changman Lee <cm224.lee@samsung.com>	2013-02-12 07:15:01 +09:00
Jaegeuk Kim	d4686d56ec	f2fs: avoid balanc_fs during evict_inode 1. Background Previously, if f2fs tries to move data blocks of an evicting inode during the cleaning process, it stops the process incompletely and then restarts the whole process, since it needs a locked inode to grab victim data pages in its address space. In order to get a locked inode, iget_locked() by f2fs_iget() is normally used, but, it waits if the inode is on freeing. So, here is a deadlock scenario. 1. f2fs_evict_inode() <- inode "A" 2. f2fs_balance_fs() 3. f2fs_gc() 4. gc_data_segment() 5. f2fs_iget() <- inode "A" too! If step #1 and #5 treat a same inode "A", step #5 would fall into deadlock since the inode "A" is on freeing. In order to resolve this, f2fs_iget_nowait() which skips __wait_on_freeing_inode() was introduced in step #5, and stops f2fs_gc() to complete f2fs_evict_inode(). 1. f2fs_evict_inode() <- inode "A" 2. f2fs_balance_fs() 3. f2fs_gc() 4. gc_data_segment() 5. f2fs_iget_nowait() <- inode "A", then stop f2fs_gc() w/ -ENOENT 2. Problem and Solution In the above scenario, however, f2fs cannot finish f2fs_evict_inode() only if: o there are not enough free sections, and o f2fs_gc() tries to move data blocks of the evicting inode repeatedly. So, the final solution is to use f2fs_iget() and remove f2fs_balance_fs() in f2fs_evict_inode(). The f2fs_evict_inode() actually truncates all the data and node blocks, which means that it doesn't produce any dirty node pages accordingly. So, we don't need to do f2fs_balance_fs() in practical. Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>	2013-02-12 07:15:01 +09:00
Jaegeuk Kim	369a708c2a	f2fs: remove the use of page_cache_release Let's remove the use of page_cache_release() in f2fs, and instead, use f2fs_put_page(page, 0) which is exactly same but for code readability. Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>	2013-02-12 07:15:01 +09:00
Namjae Jeon	324ddc702e	f2fs: fix typo mistake for data_version description In f2fs_inode_info structure, the description for data_version has a typo mistake. It should be latest instead of lastes. So, correcting that. Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>	2013-02-12 07:15:01 +09:00
Namjae Jeon	a2b52a598a	f2fs: reorganize code for ra_node_page We can remove unneeded label unlock_out, avoid unnecessary jump and reorganize the returning conditions in this function. Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>	2013-02-12 07:15:01 +09:00
Namjae Jeon	3786dfdf4f	f2fs: avoid redundant call to has_not_enough_free_secs in f2fs_gc After doing a write_checkpoint from garbage collection path if there is still need to do more garbage collection, gc_more label is used to jump and start the process again. And in that process, first step before getting victim is to check if there are not enough free sections, which is already done before doing a jump to gc_more. We can avoid the redundant call to check free sections, by checking the gc_type flag which will remain FG_GC(value 1) under this condition. Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>	2013-02-12 07:15:00 +09:00
Changman Lee	d6212a5f18	f2fs: add un/freeze_fs into super_operations This patch supports ioctl FIFREEZE and FITHAW to snapshot filesystem. Before calling f2fs_freeze, all writers would be suspended and sync_fs would be completed. So no f2fs has to do something. Just background gc operation should be skipped due to generate dirty nodes and data until unfreeze. Signed-off-by: Changman Lee <cm224.lee@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>	2013-02-12 07:15:00 +09:00
majianpeng	a2617dc686	f2fs: clean up the add_orphan_inode func For the code > prev = list_entry(orphan->list.prev, typeof(*prev), list); if orphan->list.prev == head, it can't get the right prev. And we can use the parameter 'this' to add. Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>	2013-02-12 07:15:00 +09:00
Alejandro Martinez Ruiz	aa43507f68	f2fs: fix disable_ext_identify option spelling There is a typo in the ->show_options function for disable_ext_identify. Fix it to match the spelling from the documentation. Signed-off-by: Alejandro Martinez Ruiz <alex@nowcomputing.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>	2013-02-12 07:15:00 +09:00
Jaegeuk Kim	bd43df021a	f2fs: cover global locks for reserve_new_block The fill_zero() from fallocate() calls get_new_data_page() in which calls reserve_new_block(). The reserve_new_block() should be covered by DATA_NEW, one of global locks. And also, before getting the lock, we should check free sections by calling f2fs_balance_fs(). If we break this rule, f2fs is able to face with out-of-control free space management and fall into infinite loop like the following scenario as well. [f2fs_sync_fs()] [fallocate()] - write_checkpoint() - fill_zero() - block_operations() - get_new_data_page() : grab NODE_NEW - get_dnode_of_data() : get locked dirty node page - sync_node_pages() : try to grab NODE_NEW for data allocation : trylock and skip the dirty node page : call sync_node_pages() repeatedly in order to flush all the dirty node pages! In order to avoid this, we should grab another global lock such as DATA_NEW before calling get_new_data_page() in fill_zero(). Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>	2013-02-12 07:15:00 +09:00
Jaegeuk Kim	577e349514	f2fs: prevent checkpoint once any IO failure is detected This patch enhances the checkpoint routine to cope with IO errors. Basically f2fs detects IO errors from end_io_write, and the errors are able to be occurred during one of data, node, and meta page writes. In the previous code, when an IO error is occurred during writes, f2fs sets a flag, CP_ERROR_FLAG, in the raw ckeckpoint buffer which will be written to disk. Afterwards, write_checkpoint() will check the flag and remount f2fs as a read-only (ro) mode. However, even once f2fs is remounted as a ro mode, dirty checkpoint pages are freely able to be written to disk by flusher or kswapd in background. In such a case, after cold reboot, f2fs would restore the checkpoint data having CP_ERROR_FLAG, resulting in disabling write_checkpoint and remounting f2fs as a ro mode again. Therefore, let's prevent any checkpoint page (meta) writes once an IO error is occurred, and remount f2fs as a ro mode right away at that moment. Reported-by: Oliver Winker <oliver@oli1170.net> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com> Reviewed-by: Namjae Jeon <namjae.jeon@samsung.com>	2013-02-12 07:15:00 +09:00
Changman Lee	7d79e75f64	f2fs: save device node number into f2fs_inode This patch stores inode->i_rdev into on-disk inode structure. Alun reported that: aspire tmp # mount -t f2fs /dev/sdb mnt aspire tmp # mknod mnt/sda1 b 8 1 aspire tmp # mknod mnt/null c 1 3 aspire tmp # mknod mnt/console c 5 1 aspire tmp # ls -l mnt total 2 crw-r--r-- 1 root root 5, 1 Jan 22 18:44 console crw-r--r-- 1 root root 1, 3 Jan 22 18:44 null brw-r--r-- 1 root root 8, 1 Jan 22 18:44 sda1 aspire tmp # umount mnt aspire tmp # mount -t f2fs /dev/sdb mnt aspire tmp # ls -l mnt total 2 crw-r--r-- 1 root root 0, 0 Jan 22 18:44 console crw-r--r-- 1 root root 0, 0 Jan 22 18:44 null brw-r--r-- 1 root root 0, 0 Jan 22 18:44 sda1 In this report, f2fs lost the major/minor numbers of device files after umount. The reason was revealed that f2fs does not store the inode->i_rdev to the on-disk inode data structure. So, as the other file systems do, f2fs also stores i_rdev into the i_addr fields in on-disk inode structure without any on-disk layout changes. Note that, this bug is limited to device files made by mknod(). Reported-and-Tested-by: Alun Jones <alun.linux@ty-penguin.org.uk> Signed-off-by: Changman Lee <cm224.lee@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>	2013-02-12 07:15:00 +09:00
Fengguang Wu	e56a316214	nfsd4: free_stid can be static Reported-by: Fengguang Wu <fengguang.wu@intel.com>	2013-02-11 16:22:50 -05:00
Trond Myklebust	c21443c2c7	NFSv4: Fix a reboot recovery race when opening a file If the server reboots after it has replied to our OPEN, but before we call nfs4_opendata_to_nfs4_state(), then the reboot recovery thread will not see a stateid for this open, and so will fail to recover it. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2013-02-11 15:33:14 -05:00
Trond Myklebust	65b62a29f7	NFSv4: Ensure delegation recall and byte range lock removal don't conflict Add a mutex to the struct nfs4_state_owner to ensure that delegation recall doesn't conflict with byte range lock removal. Note that we nest the new mutex _outside_ the state manager reclaim protection (nfsi->rwsem) in order to avoid deadlocks. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2013-02-11 15:33:13 -05:00
Trond Myklebust	37380e4264	NFSv4: Fix up the return values of nfs4_open_delegation_recall Adjust the return values so that they return EAGAIN to the caller in cases where we might want to retry the delegation recall after the state recovery has run. Note that we can't wait and retry in this routine, because the caller may be the state manager thread. If delegation recall fails due to a session or reboot related issue, also ensure that we mark the stateid as delegated so that nfs_delegation_claim_opens can find it again later. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2013-02-11 15:33:13 -05:00
Trond Myklebust	d25be546a8	NFSv4.1: Don't lose locks when a server reboots during delegation return If the server reboots while we are converting a delegation into OPEN/LOCK stateids as part of a delegation return, the current code will simply exit with an error. This causes us to lose both delegation state and locking state (i.e. locking atomicity). Deal with this by exposing the delegation stateid during delegation return, so that we can recover the delegation, and then resume open/lock recovery. Note that not having to hold the nfs_inode->rwsem across the calls to nfs_delegation_claim_opens() also fixes a deadlock against the NFSv4.1 reboot recovery code. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2013-02-11 15:33:12 -05:00
Trond Myklebust	9a99af494b	NFSv4.1: Prevent deadlocks between state recovery and file locking We currently have a deadlock in which the state recovery thread ends up blocking due to one of the locks which it is trying to recover holding the nfs_inode->rwsem. The situation is as follows: the state recovery thread is scheduled in order to recover from a reboot. It immediately drains the session, forcing all ordinary NFSv4.1 calls to nfs41_setup_sequence() to be put to sleep. This includes the file locking process that holds the nfs_inode->rwsem. When the thread gets to nfs4_reclaim_locks(), it tries to grab a write lock on nfs_inode->rwsem, and boom... Fix is to have the lock drop the nfs_inode->rwsem while it is doing RPC calls. We use a sequence lock in order to signal to the locking process whether or not a state recovery thread has run on that inode, in which case it should retry the lock. Reported-by: Andy Adamson <andros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2013-02-11 15:33:12 -05:00
Trond Myklebust	c137afabe3	NFSv4: Allow the state manager to mark an open_owner as being recovered This patch adds a seqcount_t lock for use by the state manager to signal that an open owner has been recovered. This mechanism will be used by the delegation, open and byte range lock code in order to figure out if they need to replay requests due to collisions with lock recovery. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2013-02-11 15:33:11 -05:00
Rafael J. Wysocki	a9834cb205	Merge branch 'acpi-pm' * acpi-pm: (35 commits) ACPI / PM: Handle missing _PSC in acpi_bus_update_power() ACPI / PM: Do not power manage devices in unknown initial states ACPI / PM: Fix acpi_bus_get_device() check in drivers/acpi/device_pm.c ACPI / PM: Fix /proc/acpi/wakeup for devices w/o bus or parent ACPI / PM: Fix consistency check for power resources during resume ACPI / PM: Expose lists of device power resources to user space sysfs: Functions for adding/removing symlinks to/from attribute groups ACPI / PM: Expose current status of ACPI power resources ACPI / PM: Expose power states of ACPI devices to user space ACPI / scan: Prevent device add uevents from racing with user space ACPI / PM: Fix device power state value after transitions to D3cold ACPI / PM: Use string "D3cold" to represent ACPI_STATE_D3_COLD ACPI / PM: Sanitize checks in acpi_power_on_resources() ACPI / PM: Always evaluate _PSn after setting power resources ACPI / PM: Introduce helper for executing _PSn methods ACPI / PM: Make acpi_bus_init_power() more robust ACPI / PM: Fix build for unusual combination of Kconfig options ACPI / PM: remove leading whitespace from #ifdef ACPI / PM: Consolidate suspend-specific and hibernate-specific code ACPI / PM: Move device power management functions to device_pm.c ...	2013-02-11 13:20:56 +01:00
M. Mohan Kumar	b6f4bee02f	fs/9p: Fix atomic_open Return EEXISTS if requested file already exists, without this patch open call will always succeed even if the file exists and user specified O_CREAT\|O_EXCL. Following test code can be used to verify this patch. Without this patch executing following test code on 9p mount will result in printing 'test case failed' always. main() { int fd; /* first create the file / fd = open("./file", O_CREAT\|O_WRONLY); if (fd < 0) { perror("open"); return -1; } close(fd); / Now opening same file with O_CREAT\|O_EXCL should fail */ fd = open("./file", O_CREAT\|O_EXCL); if (fd < 0 && errno == EEXIST) printf("test case pass\n"); else printf("test case failed\n"); close(fd); return 0; } Signed-off-by: M. Mohan Kumar <mohan@in.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2013-02-10 16:29:59 -06:00
Aneesh Kumar K.V	03f0e02273	fs/9p: Don't use O_TRUNC flag in TOPEN and TLOPEN request We do the truncate via setattr request, hence don't pass the O_TRUNC flag in open request. Without this patch we end up sending zero sized write request to server when we try to truncate. Some servers (VirtFS) were not handling that properly. Reported-by: M. Mohan Kumar <mohan@in.ibm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2013-02-10 16:29:47 -06:00
Al Viro	7ffdea7ea3	locking in fs/9p ->readdir() ... is really excessive. First of all, ->readdir() is serialized by file->f_path.dentry->d_inode->i_mutex; playing with file->f_path.dentry->d_lock is not buying you anything. Moreover, rdir->mutex is pointless for exactly the same reason - you'll never see contention on it. While we are at it, there's no point in having rdir->buf a pointer - you have it point just past the end of rdir, so it might as well be a flex array (and no, it's not a gccism). Absolutely untested patch follows: Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2013-02-10 16:29:33 -06:00
Theodore Ts'o	b6e96d0067	jbd2: use module parameters instead of debugfs for jbd_debug There are multiple reasons to move away from debugfs. First of all, we are only using it for a single parameter, and it is much more complicated to set up (some 30 lines of code compared to 3), and one more thing that might fail while loading the jbd2 module. Secondly, as a module paramter it can be specified as a boot option if jbd2 is built into the kernel, or as a parameter when the module is loaded, and it can also be manipulated dynamically under /sys/module/jbd2/parameters/jbd2_debug. So it is more flexible. Ultimately we want to move away from using jbd_debug() towards tracepoints, but for now this is still a useful simplification of the code base. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2013-02-09 16:29:20 -05:00
Theodore Ts'o	a0b30c1229	ext4: use module parameters instead of debugfs for mballoc_debug There are multiple reasons to move away from debugfs. First of all, we are only using it for a single parameter, and it is much more complicated to set up (some 30 lines of code compared to 3), and one more thing that might fail while loading the ext4 module. Secondly, as a module paramter it can be specified as a boot option if ext4 is built into the kernel, or as a parameter when the module is loaded, and it can also be manipulated dynamically under /sys/module/ext4/parameters/mballoc_debug. So it is more flexible. Ultimately we want to move away from using mb_debug() towards tracepoints, but for now this is still a useful simplification of the code base. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2013-02-09 16:28:20 -05:00
Theodore Ts'o	1139575a92	ext4: start handle at the last possible moment when creating inodes In ext4_{create,mknod,mkdir,symlink}(), don't start the journal handle until the inode has been succesfully allocated. In order to do this, we need to start the handle in the ext4_new_inode(). So create a new variant of this function, ext4_new_inode_start_handle(), so the handle can be created at the last possible minute, before we need to modify the inode allocation bitmap block. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2013-02-09 16:27:09 -05:00
Theodore Ts'o	95eaefbdec	ext4: fix the number of credits needed for acl ops with inline data Operations which modify extended attributes may need extra journal credits if inline data is used, since there is a chance that some extended attributes may need to get pushed to an external attribute block. Changes to reflect this was made in xattr.c, but they were missed in fs/ext4/acl.c. To fix this, abstract the calculation of the number of credits needed for xattr operations to an inline function defined in ext4_jbd2.h, and use it in acl.c and xattr.c. Also move the function declarations used in inline.c from xattr.h (where they are non-obviously hidden, and caused problems since ext4_jbd2.h needs to use the function ext4_has_inline_data), and move them to ext4.h. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Tao Ma <boyu.mt@taobao.com> Reviewed-by: Jan Kara <jack@suse.cz>	2013-02-09 15:23:03 -05:00
Theodore Ts'o	64044abf05	ext4: fix the number of credits needed for ext4_unlink() and ext4_rmdir() The ext4_unlink() and ext4_rmdir() don't actually release the blocks associated with the file/directory. This gets done in a separate jbd2 handle called via ext4_evict_inode(). Thus, we don't need to reserve lots of journal credits for the truncate. Note that using too many journal credits is non-optimal because it can leading to the journal transmit getting closed too early, before it is strictly necessary. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>	2013-02-09 15:06:24 -05:00
Theodore Ts'o	4b217630d0	ext4: fix the number of credits needed for ext4_ext_migrate() The migration ioctl creates a temporary inode. Since this inode is never linked to a directory, we don't need to reserve journal credits required for modifying the directory. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>	2013-02-09 12:50:27 -05:00
Theodore Ts'o	8dcfaad244	ext4: start handle at the last possible moment in ext4_rmdir() Don't start the jbd2 transaction handle until after the directory entry has been found, to minimize the amount of time that a handle is held active. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>	2013-02-09 09:45:11 -05:00
Theodore Ts'o	931b68649d	ext4: start handle at the last possible moment in ext4_unlink() Don't start the jbd2 transaction handle until after the directory entry has been found, to minimize the amount of time that a handle is held active. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>	2013-02-09 09:43:39 -05:00
Theodore Ts'o	47564bfb95	ext4: grab page before starting transaction handle in write_begin() The grab_cache_page_write_begin() function can potentially sleep for a long time, since it may need to do memory allocation which can block if the system is under significant memory pressure, and because it may be blocked on page writeback. If it does take a long time to grab the page, it's better that we not hold an active jbd2 handle. So grab a handle on the page first, and _then_ start the transaction handle. This commit fixes the following long transaction handle hold time: postmark-2917 [000] .... 196.435786: jbd2_handle_stats: dev 254,32 tid 570 type 2 line_no 2541 interval 311 sync 0 requested_blocks 1 dirtied_blocks 0 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>	2013-02-09 09:24:14 -05:00
Theodore Ts'o	9924a92a8c	ext4: pass context information to jbd2__journal_start() So we can better understand what bits of ext4 are responsible for long-running jbd2 handles, use jbd2__journal_start() so we can pass context information for logging purposes. The recommended way for finding the longer-running handles is: T=/sys/kernel/debug/tracing EVENT=$T/events/jbd2/jbd2_handle_stats echo "interval > 5" > $EVENT/filter echo 1 > $EVENT/enable ./run-my-fs-benchmark cat $T/trace > /tmp/problem-handles This will list handles that were active for longer than 20ms. Having longer-running handles is bad, because a commit started at the wrong time could stall for those 20+ milliseconds, which could delay an fsync() or an O_SYNC operation. Here is an example line from the trace file describing a handle which lived on for 311 jiffies, or over 1.2 seconds: postmark-2917 [000] .... 196.435786: jbd2_handle_stats: dev 254,32 tid 570 type 2 line_no 2541 interval 311 sync 0 requested_blocks 1 dirtied_blocks 0 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2013-02-08 21:59:22 -05:00
Jeff Layton	01a7decf75	nfsd: keep a checksum of the first 256 bytes of request Now that we're allowing more DRC entries, it becomes a lot easier to hit problems with XID collisions. In order to mitigate those, calculate a checksum of up to the first 256 bytes of each request coming in and store that in the cache entry, along with the total length of the request. This initially used crc32, but Chuck Lever and Jim Rees pointed out that crc32 is probably more heavyweight than we really need for generating these checksums, and recommended looking at using the same routines that are used to generate checksums for IP packets. On an x86_64 KVM guest measurements with ftrace showed ~800ns to use csum_partial vs ~1750ns for crc32. The difference probably isn't terribly significant, but for now we may as well use csum_partial. Signed-off-by: Jeff Layton <jlayton@redhat.com> Stones-thrown-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2013-02-08 16:02:26 -05:00
Theodore Ts'o	722887ddc8	ext4: move the jbd2 wrapper functions out of super.c Move the jbd2 wrapper functions which start and stop handles out of super.c, where they don't really logically belong, and into ext4_jbd2.c. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2013-02-08 13:00:31 -05:00
Theodore Ts'o	343d9c283c	jbd2: add tracepoints which provide per-handle statistics Handles which stay open a long time are problematic when it comes time to close down a transaction so it can be committed. These tracepoints will help us determine which ones are the problematic ones, and to validate whether changes makes things better or worse. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2013-02-08 13:00:22 -05:00
Al Viro	b7f7a5e0be	f2fs: get rid of fake on-stack dentries those should never be used for a lot of reasons... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-02-08 02:55:04 -05:00
Al Viro	69f24eac55	f2fs: switch init_inode_metadata() to passing parent and name separately ... sure, it's tempting to just pass dentry. Except that we don't _have_ anything resembling a real dentry on one of the paths to it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-02-08 02:55:04 -05:00
Al Viro	c004363dd6	f2fs: switch new_inode_page() from dentry to qstr Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-02-08 02:55:03 -05:00
Al Viro	53dc9a6776	f2fs: init_dent_inode() should take qstr for one thing, it doesn't (and shouldn't) use anything else from dentry; for another, on some call chains the dentry is fake and should be eliminated completely. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-02-08 02:55:03 -05:00
Linus Torvalds	8d19514fad	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "We've got corner cases for updating i_size that ceph was hitting, error handling for quotas when we run out of space, a very subtle snapshot deletion race, a crash while removing devices, and one deadlock between subvolume creation and the sb_internal code (thanks lockdep)." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: move d_instantiate outside the transaction during mksubvol Btrfs: fix EDQUOT handling in btrfs_delalloc_reserve_metadata Btrfs: fix possible stale data exposure Btrfs: fix missing i_size update Btrfs: fix race between snapshot deletion and getting inode Btrfs: fix missing release of the space/qgroup reservation in start_transaction() Btrfs: fix wrong sync_writers decrement in btrfs_file_aio_write() Btrfs: do not merge logged extents if we've removed them from the tree btrfs: don't try to notify udev about missing devices	2013-02-08 12:06:46 +11:00
Clark Williams	8bd75c77b7	sched/rt: Move rt specific bits into new header file Move rt scheduler definitions out of include/linux/sched.h into new file include/linux/sched/rt.h Signed-off-by: Clark Williams <williams@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/20130207094707.7b9f825f@riff.lan Signed-off-by: Ingo Molnar <mingo@kernel.org>	2013-02-07 20:51:08 +01:00
Wang Shilong	712ddc52ff	Ext2: remove the static function release_blocks to optimize the kernel Because the static function 'release_blocks' is only called when releasing blocks,it will be more simple and efficient to call the function 'percpu_counter_add' directly. Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com> Signed-off-by: Jan Kara <jack@suse.cz>	2013-02-07 16:44:56 +01:00
Wang Shilong	8e3dffc651	Ext2: mark inode dirty after the function dquot_free_block_nodirty is called We should mark inode dirty after the function dquot_free_block_nodirty is called.Besides,add a check whether it is necessary to call dquot_free_block_nodirty functon. Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com> Signed-off-by: Jan Kara <jack@suse.cz>	2013-02-07 16:44:55 +01:00
Alex Elder	311f08acde	xfs: memory barrier before wake_up_bit() In xfs_ifunlock() there is a call to wake_up_bit() after clearing the flush lock on the xfs inode. This is not guaranteed to be safe, as noted in the comments above wake_up_bit() beginning with: In order for this to function properly, as it uses waitqueue_active() internally, some kind of memory barrier must be done prior to calling this. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-02-07 09:39:48 -06:00
Eric Wong	634734b63a	fuse: allow control of adaptive readdirplus use For some filesystems (e.g. GlusterFS), the cost of performing a normal readdir and readdirplus are identical. Since adaptively using readdirplus has no benefit for those systems, give users/filesystems the option to control adaptive readdirplus use. v2 of this patch incorporates Miklos's suggestion to simplify the code, as well as improving consistency of macro names and documentation. Signed-off-by: Eric Wong <normalperson@yhbt.net> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2013-02-07 14:25:44 +01:00
Theodore Ts'o	9fff24aa2c	jbd2: track request delay statistics Track the delay between when we first request that the commit begin and when it actually begins, so we can see how much of a gap exists. In theory, this should just be the remaining scheduling quantuum of the thread which requested the commit (assuming it was not a synchronous operation which triggered the commit request) plus scheduling overhead; however, it's possible that real time processes might get in the way of letting the kjournald thread from executing. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2013-02-06 22:30:23 -05:00
Chris Mason	1a65e24b0b	Btrfs: move d_instantiate outside the transaction during mksubvol Dave Sterba triggered a lockdep complaint about lock ordering between the sb_internal lock and the cleaner semaphore. btrfs_lookup_dentry() checks for orphans if we're looking up the inode for a subvolume, and subvolume creation is triggering the lookup with a transaction running. This commit moves the d_instantiate after the transaction closes. Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-02-06 12:11:10 -05:00
Jan Schmidt	eb6b88d92c	Btrfs: fix EDQUOT handling in btrfs_delalloc_reserve_metadata When btrfs_qgroup_reserve returned a failure, we were missing a counter operation for BTRFS_I(inode)->outstanding_extents++, leading to warning messages about outstanding extents and space_info->bytes_may_use != 0. Additionally, the error handling code didn't take into account that we dropped the inode lock which might require more cleanup. Luckily, all the cleanup code we need is already there and can be shared with reserve_metadata_bytes, which is exactly what this patch does. Reported-by: Lev Vainblat <lev@zadarastorage.com> Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-02-06 09:24:40 -05:00
Wang Shilong	98783e453c	Ext2: remove the overhead check about sb in the function ext2_new_blocks It can be guranteed that inode->i_sb should not be null in vfs. So here the check about it is overhead. Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com> Signed-off-by: Jan Kara <jack@suse.cz>	2013-02-06 13:47:02 +01:00
Chris Mason	24f8ebe918	Merge git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next.git for-chris into for-linus	2013-02-05 19:24:44 -05:00
Josef Bacik	59fe4f4197	Btrfs: fix possible stale data exposure We specifically do not update the disk i_size if there are ordered extents outstanding for any area between the current disk_i_size and our ordered extent so that we do not expose stale data. The problem is the check we have only checks if the ordered extent starts at or after the current disk_i_size, which doesn't take into account an ordered extent that starts before the current disk_i_size and ends past the disk_i_size. Fix this by checking if the extent ends past the disk_i_size. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-05 16:09:16 -05:00
Josef Bacik	5d1f40202b	Btrfs: fix missing i_size update If we have an ordered extent before the ordered extent we are currently completing that is after the current disk_i_size we will put our i_size update into that ordered extent so that we do not expose stale data. The problem is that if our disk i_size is updated past the previous ordered extent we won't update the i_size with the pending i_size update. So check the pending i_size update and if its above the current disk i_size we need to go ahead and try to update. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-05 16:09:14 -05:00
Liu Bo	6f1c36055f	Btrfs: fix race between snapshot deletion and getting inode While running snapshot testscript created by Mitch and David, the race between autodefrag and snapshot deletion can lead to corruption of dead_root list so that we can get crash on btrfs_clean_old_snapshots(). And besides autodefrag, scrub also does the same thing, ie. read root first and get inode. Here is the story(take autodefrag as an example): (1) when we delete a snapshot or subvolume, it will set its root's refs to zero and do a iput() on its own inode, and if this inode happens to be the only active in-meory one in root's inode rbtree, it will add itself to the global dead_roots list for later cleanup. (2) after (1), the autodefrag thread may read another inode for defrag and the inode is just in the deleted snapshot/subvolume, but all of these are without checking if the root is still valid(refs > 0). So the end up result is adding the deleted snapshot/subvolume's root to the global dead_roots list AGAIN. Fortunately, we already have a srcu lock to avoid the race, ie. subvol_srcu. So all we need to do is to take the lock to protect 'read root and get inode', since we synchronize to wait for the rcu grace period before adding something to the global dead_roots list. Reported-by: Mitch Harder <mitch.harder@sabayonlinux.org> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-05 16:09:13 -05:00
Miao Xie	843fcf3573	Btrfs: fix missing release of the space/qgroup reservation in start_transaction() When we fail to start a transaction, we need to release the reserved free space and qgroup space, fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-05 16:09:11 -05:00
Miao Xie	0a3404dcff	Btrfs: fix wrong sync_writers decrement in btrfs_file_aio_write() If the checks at the beginning of btrfs_file_aio_write() fail, we needn't decrease ->sync_writers, because we have not increased it. Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-05 16:09:10 -05:00
Josef Bacik	222c81dc38	Btrfs: do not merge logged extents if we've removed them from the tree You can run into this problem where if somebody is fsyncing and writing out the existing extents you will have removed the extent map from the em tree, but it's still valid for the current fsync so we go ahead and write it. The problem is we unconditionally try to merge it back into the em tree, but if we've removed it from the em tree that will cause use after free problems. Fix this to only merge if we are still a part of the tree. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-05 16:09:03 -05:00
Jan Kara	288be96de6	udf: Remove unused s_extLength from udf_bitmap s_extLength was assigned to but the value was never really used. So just remove the field. Signed-off-by: Jan Kara <jack@suse.cz>	2013-02-05 17:29:53 +01:00
Jan Kara	c60305b578	udf: Make s_block_bitmap standard array struct udf_bitmap has array of buffer pointers attached to it. The code unnecessarily used s_block_bitmap as a pointer to the array instead of the standard trick of using 0 length array in the declaration. Change that to make code more readable and actually shrink the structure by one pointer. Signed-off-by: Jan Kara <jack@suse.cz>	2013-02-05 17:29:52 +01:00
Jan Kara	89b1f39eb4	udf: Fix bitmap overflow on large filesystems with small block size For large UDF filesystems with 512-byte blocks the number of necessary bitmap blocks is larger than 2^16 so s_nr_groups in udf_bitmap overflows (the number will overflow for filesystems larger than 128 GB with 512-byte blocks). That results in ENOSPC errors despite the filesystem has plenty of free space. Fix the problem by changing s_nr_groups' type to 'int'. That is enough even for filesystems 2^32 blocks (UDF maximum) and 512-byte blocksize. Reported-and-tested-by: v10lator@myway.de Signed-off-by: Jan Kara <jack@suse.cz>	2013-02-05 17:29:30 +01:00
Chris Mason	0e4e026366	Merge branch 'for-linus' into raid56-experimental Conflicts: fs/btrfs/volumes.c Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-02-05 10:04:03 -05:00
Chris Mason	1f0905ec15	Btrfs: remove conflicting check for minimum number of devices in raid56 The device removal code was incorrectly checking against two different limits for raid5 and raid6. Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-02-05 10:01:42 -05:00
Tomasz Torcz	10e78e3a8a	Btrfs: select XOR_BLOCKS in Kconfig The Btrfs raid56 uses the generic xor helpers. Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-02-05 09:55:30 -05:00
Jeff Layton	5976687a2b	sunrpc: move address copy/cmp/convert routines and prototypes from clnt.h to addr.h These routines are used by server and client code, so having them in a separate header would be best. Signed-off-by: Jeff Layton <jlayton@redhat.com> Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2013-02-05 09:41:14 -05:00
J. Bruce Fields	3abdb60712	nfsd4: simplify idr allocation We don't really need to preallocate at all; just allocate and initialize everything at once, but leave the sc_type field initially 0 to prevent finding the stateid till it's fully initialized. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2013-02-05 09:41:12 -05:00
majianpeng	2d32b29a1c	nfsd: Fix memleak When free nfs-client, it must free the ->cl_stateids. Cc: stable@kernel.org Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2013-02-05 09:40:47 -05:00
Ingo Molnar	b2c77a57e4	This implements the cputime accounting on full dynticks CPUs. Typical cputime stats infrastructure relies on the timer tick and its periodic polling on the CPU to account the amount of time spent by the CPUs and the tasks per high level domains such as userspace, kernelspace, guest, ... Now we are preparing to implement full dynticks capability on Linux for Real Time and HPC users who want full CPU isolation. This feature requires a cputime accounting that doesn't depend on the timer tick. To implement it, this new cputime infrastructure plugs into kernel/user/guest boundaries to take snapshots of cputime and flush these to the stats when needed. This performs pretty much like CONFIG_VIRT_CPU_ACCOUNTING except that context location and cputime snaphots are synchronized between write and read side such that the latter can safely retrieve the pending tickless cputime of a task and add it to its latest cputime snapshot to return the correct result to the user. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iQIcBAABAgAGBQJRBsKnAAoJEIUkVEdQjox3lMgP/2R6DU2f8PyGIao3hne4M3Pu L3q+mAG53b24Dy014KeW7gd8yv45fE7wp/rs8CGLte9VzbLkRCDSFQPgBuXVagRj tV5nfAuqD0wHTnA+HhBE3l3C2RKAPGIu79rBpnIR/QIPPl8Z3Dby8YgmxEQKDf8G j7MEBu2LthSuqEi2ZXemnO5r0oEnQAzAp4TTi/M38k0Fmt59nOGyjLnI+xHYCBMa 1pnz7j3jjR9NJExGu8iVvbo+jupuQngP8qmkLXHvYnj/TEJNwzO1hHVoSwOpjYpS 9ycl+T8IKQLbAkBywLtq3Mzde43xt/t8wYyGZ0oAV+Z7MIpz/9YIfDJwqQeqoNbD dAdbNjKMbsxCgmrnyqSagfMQg/r3CPZ4vf40TMCaN4gNUJC4Ie+E4kPRKRh59+PB Ukthmqujn0f40LAa+HXTUuzafd3b0s/ewH+8FuQ6LAG9b5+WnoN8JTJ5u6+ydokO ZleeOowuRZZEg+abQ8Sm2GRm/BzN29gi/npb//I+ZDXWv/+3yccgsiPjCRzCAAaO g1RmYryFSRUwHQbGNNypVWVuOLWvrBQ4jqbGO7BBuBByZMSHryKxR6mb+inH3qLE xIDM9SdSJisc292OzoFKwVZki4MaXaadJXJduVvqYlZQvXXs7eAa4wo3euhtVITD NLQO5OZXE4oIQmDFb0FV =1Tzp -----END PGP SIGNATURE----- Merge tag 'full-dynticks-cputime-for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks into sched/core Pull full-dynticks (user-space execution is undisturbed and receives no timer IRQs) preparation changes that convert the cputime accounting code to be full-dynticks ready, from Frederic Weisbecker: "This implements the cputime accounting on full dynticks CPUs. Typical cputime stats infrastructure relies on the timer tick and its periodic polling on the CPU to account the amount of time spent by the CPUs and the tasks per high level domains such as userspace, kernelspace, guest, ... Now we are preparing to implement full dynticks capability on Linux for Real Time and HPC users who want full CPU isolation. This feature requires a cputime accounting that doesn't depend on the timer tick. To implement it, this new cputime infrastructure plugs into kernel/user/guest boundaries to take snapshots of cputime and flush these to the stats when needed. This performs pretty much like CONFIG_VIRT_CPU_ACCOUNTING except that context location and cputime snaphots are synchronized between write and read side such that the latter can safely retrieve the pending tickless cputime of a task and add it to its latest cputime snapshot to return the correct result to the user." Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org>	2013-02-05 13:10:33 +01:00
Linus Torvalds	fe547d7714	Merge branch 'fix-max-write' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm Pull dlm fix from David Teigland: "Thanks to Jana who reported the problem and was able to test this fix so quickly." This fixes an incorrect size check that triggered for CONFIG_COMPAT whether the code was actually doing compat or not. The incorrect write size check broke userland (clvmd) when maximum resource name lengths are used. * 'fix-max-write' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm: dlm: check the write size from user	2013-02-05 20:50:11 +11:00
Vyacheslav Dubeyko	a9bae18954	nilfs2: fix fix very long mount time issue There exists a situation when GC can work in background alone without any other filesystem activity during significant time. The nilfs_clean_segments() method calls nilfs_segctor_construct() that updates superblocks in the case of NILFS_SC_SUPER_ROOT and THE_NILFS_DISCONTINUED flags are set. But when GC is working alone the nilfs_clean_segments() is called with unset THE_NILFS_DISCONTINUED flag. As a result, the update of superblocks doesn't occurred all this time and in the case of SPOR superblocks keep very old values of last super root placement. SYMPTOMS: Trying to mount a NILFS2 volume after SPOR in such environment ends with very long mounting time (it can achieve about several hours in some cases). REPRODUCING PATH: 1. It needs to use external USB HDD, disable automount and doesn't make any additional filesystem activity on the NILFS2 volume. 2. Generate temporary file with size about 100 - 500 GB (for example, dd if=/dev/zero of=<file_name> bs=1073741824 count=200). The size of file defines duration of GC working. 3. Then it needs to delete file. 4. Start GC manually by means of command "nilfs-clean -p 0". When you start GC by means of such way then, at the end, superblocks is updated by once. So, for simulation of SPOR, it needs to wait sometime (15 - 40 minutes) and simply switch off USB HDD manually. 5. Switch on USB HDD again and try to mount NILFS2 volume. As a result, NILFS2 volume will mount during very long time. REPRODUCIBILITY: 100% FIX: This patch adds checking that superblocks need to update and set THE_NILFS_DISCONTINUED flag before nilfs_clean_segments() call. Reported-by: Sergey Alexandrov <splavgm@gmail.com> Signed-off-by: Vyacheslav Dubeyko <slava@dubeyko.com> Tested-by: Vyacheslav Dubeyko <slava@dubeyko.com> Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Tested-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-02-05 20:38:46 +11:00
Jeff Layton	b4e7f2c945	nfsd: register a shrinker for DRC cache entries Since we dynamically allocate them now, allow the system to call us up to release them if it gets low on memory. Since these entries aren't replaceable, only free ones that are expired or that are over the cap. The the seeks value is set to '1' however to indicate that freeing the these entries is low-cost. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2013-02-04 17:19:13 -05:00
Jeff Layton	aca8a23de6	nfsd: add recurring workqueue job to clean the cache It's not sufficient to only clean the cache when requests come in. What if we have a flurry of activity and then the server goes idle? Add a workqueue job that will clean the cache every RC_EXPIRE period. Care is taken to only run this when we expect to have entries expiring. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2013-02-04 17:19:12 -05:00
Jeff Layton	2c6b691c05	nfsd: when updating an entry with RC_NOCACHE, just free it There's no need to keep entries around that we're declaring RC_NOCACHE. Ditto if there's a problem with the entry. With this change too, there's no need to test for RC_UNUSED in the search function. If the entry's in the hash table then it's either INPROG or DONE. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2013-02-04 17:19:11 -05:00
Jeff Layton	13cc8a78e8	nfsd: remove the cache_disabled flag With the change to dynamically allocate entries, the cache is never disabled on the fly. Remove this flag. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2013-02-04 17:19:11 -05:00
Jeff Layton	0338dd1572	nfsd: dynamically allocate DRC entries The existing code keeps a fixed-size cache of 1024 entries. This is much too small for a busy server, and wastes memory on an idle one. This patch changes the code to dynamically allocate and free these cache entries. A cap on the number of entries is retained, but it's much larger than the existing value and now scales with the amount of low memory in the machine. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2013-02-04 17:19:10 -05:00
Jeff Layton	0ee0bf7ee5	nfsd: track the number of DRC entries in the cache Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2013-02-04 17:19:09 -05:00
Jeff Layton	56c2548b2d	nfsd: always move DRC entries to the end of LRU list when updating timestamp ...otherwise, we end up with the list ordering wrong. Currently, it's not a problem since we skip RC_INPROG entries, but keeping the ordering strict will be necessary for a later patch that adds a cache cleaner. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2013-02-04 17:19:09 -05:00
David Teigland	d4b0bcf32b	dlm: check the write size from user Return EINVAL from write if the size is larger than allowed. Do this before allocating kernel memory for the bogus size, which could lead to OOM. Reported-by: Sasha Levin <levinsasha928@gmail.com> Tested-by: Jana Saout <jana@saout.de> Signed-off-by: David Teigland <teigland@redhat.com>	2013-02-04 15:31:22 -06:00
Theodore Ts'o	40ae348762	ext4: optimize mballoc for large allocations The ext4 block allocator only maintains buddy bitmaps for chunks which are less than or equal to one quarter of a block group. That is, for a file aystem with a 1k blocksize, and where the number of blocks in a block group is 8192 blocks, the largest chunk size tracked by buddy bitmaps is 2048 blocks. For a file system with a 4k blocksize, and where the number of blocks in a block group is 32768 blocks, the largest chunk size tracked by buddy bitmaps is 8192 blocks. To work around this code, mballoc.c before this commit would truncate allocation requests to the number of blocks in a block group minus 10. Why 10? Aside from being a completely arbitrary number, it avoids block allocation to be a power of two larger than 25% of the block group. If you try to explicitly fallocate 50% of the block group size, this will demonstrate the problem; the block allocation code will scan the all of the blocks in the file system with cr==0 (since the request is for a natural power of two), but then completely fail for all blocks groups, since the buddy bitmaps don't track chunk sizes of 50% of the block group. To fix this, in these we use ext4_mb_complex_scan_group() instead of ext4_mb_simple_scan_group(). Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Andreas Dilger <adilger@dilger.ca>	2013-02-04 15:08:40 -05:00
Enke Chen	0415d29102	fuse: send poll events commit `626cf23660` "poll: add poll_requested_events()..." enabled us to send the requested events to the filesystem. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2013-02-04 16:14:32 +01:00
Miklos Szeredi	dfca7cebc2	fuse: don't WARN when nlink is zero drop_nlink() warns if nlink is already zero. This is triggerable by a buggy userspace filesystem. The cure, I think, is worse than the disease so disable the warning. Reported-by: Tero Roponen <tero.roponen@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2013-02-04 15:57:42 +01:00
Eric Wong	6a4e922c3d	fuse: avoid out-of-scope stack access The all pointers within fuse_req must point to valid memory once fuse_force_forget() returns. This bug appeared in "fuse: implement NFS-like readdirplus support" and was never in any official Linux release. I tested the fuse_force_forget() code path by injecting to fake -ENOMEM and verified the FORGET operation was called properly in userspace. Signed-off-by: Eric Wong <normalperson@yhbt.net> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2013-02-04 15:22:23 +01:00
Jeff Layton	2eeb9b2abc	nfsd: initialize the exp->ex_uuid field in svc_export_init commit `885c91f746` in Bruce's tree was causing oopses for me: general protection fault: 0000 [#1] SMP Modules linked in: nfsd(OF) nfs_acl(OF) auth_rpcgss(OF) lockd(OF) sunrpc(OF) kvm_amd kvm microcode i2c_piix4 virtio_net virtio_balloon cirrus drm_kms_helper ttm drm virtio_blk i2c_core CPU 0 Pid: 564, comm: exportfs Tainted: GF O 3.8.0-0.rc5.git2.1.fc19.x86_64 #1 Bochs Bochs RIP: 0010:[<ffffffff811b1509>] [<ffffffff811b1509>] kfree+0x49/0x280 RSP: 0018:ffff88007a3d7c50 EFLAGS: 00010203 RAX: 01adaf8dadadad80 RBX: 6b6b6b6b6b6b6b6b RCX: 0000000000000001 RDX: ffffffff7fffffff RSI: 0000000000000000 RDI: 6b6b6b6b6b6b6b6b RBP: ffff88007a3d7c80 R08: 6b6b6b6b6b6b6b6b R09: 0000000000000000 R10: 0000000000000018 R11: 0000000000000000 R12: ffff88006a117b50 R13: ffffffffa01a589c R14: ffff8800631b0f50 R15: 01ad998dadadad80 FS: 00007fcaa3616740(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007f5d84b6fdd8 CR3: 0000000064db4000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process exportfs (pid: 564, threadinfo ffff88007a3d6000, task ffff88006af28000) Stack: ffff88007a3d7c80 ffff88006a117b68 ffff88006a117b50 0000000000000000 ffff8800631b0f50 ffff88006a117b50 ffff88007a3d7ca0 ffffffffa01a589c ffff880036be1148 ffff88007a3d7cf8 ffff88007a3d7e28 ffffffffa01a6a98 Call Trace: [<ffffffffa01a589c>] svc_export_put+0x5c/0x70 [nfsd] [<ffffffffa01a6a98>] svc_export_parse+0x328/0x7e0 [nfsd] [<ffffffffa016f1c7>] cache_do_downcall+0x57/0x70 [sunrpc] [<ffffffffa016f25e>] cache_downcall+0x7e/0x100 [sunrpc] [<ffffffffa016f338>] cache_write_procfs+0x58/0x90 [sunrpc] [<ffffffffa016f2e0>] ? cache_downcall+0x100/0x100 [sunrpc] [<ffffffff8123b0e5>] proc_reg_write+0x75/0xb0 [<ffffffff811ccecf>] vfs_write+0x9f/0x170 [<ffffffff811cd089>] sys_write+0x49/0xa0 [<ffffffff816e0919>] system_call_fastpath+0x16/0x1b Code: 66 66 66 90 48 83 fb 10 0f 86 c3 00 00 00 48 89 df 49 bf 00 00 00 00 00 ea ff ff e8 f2 12 ea ff 48 c1 e8 0c 48 c1 e0 06 49 01 c7 <49> 8b 07 f6 c4 80 0f 85 1d 02 00 00 49 8b 07 a8 80 0f 84 ee 01 RIP [<ffffffff811b1509>] kfree+0x49/0x280 RSP <ffff88007a3d7c50> I think Majianpeng's patch is correct, but incomplete. In order for it to be safe to free the ex_uuid unconditionally in svc_export_put, we need to make sure it's initialized to NULL in the init routine. Cc: majianpeng <majianpeng@gmail.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2013-02-04 09:16:24 -05:00
Jeff Layton	a4a3ec3291	nfsd: break out hashtable search into separate function Later, we'll need more than one call site for this, so break it out into a new function. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2013-02-04 09:16:24 -05:00
Jeff Layton	d1a0774de6	nfsd: clean up and clarify the cache expiration code Add a preprocessor constant for the expiry time of cache entries, and move the test for an expired entry into a function. Note that the current code does not test for RC_INPROG. It just assumes that it won't take more than 2 minutes to fill out an in-progress entry. I'm not sure how valid that assumption is though, so let's just ensure that we never consider an RC_INPROG entry to be expired. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2013-02-04 09:16:23 -05:00
Jeff Layton	25e6b8b0e1	nfsd: remove redundant test from nfsd_reply_cache_free Entries can only get a c_type of RC_REPLBUFF iff they are RC_DONE. Therefore the test for RC_DONE isn't necessary here. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2013-02-04 09:16:22 -05:00
Jeff Layton	f09841fdfa	nfsd: add alloc and free functions for DRC entries Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2013-02-04 09:16:22 -05:00
Jeff Layton	8a8bc40d9b	nfsd: create a dedicated slabcache for DRC entries Currently we use kmalloc() which wastes a little bit of memory on each allocation since it's a power of 2 allocator. Since we're allocating a 1024 of these now, and may need even more later, let's create a new slabcache for them. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2013-02-04 09:16:21 -05:00
Jeff Layton	09662d58d5	nfsd: get rid of RC_INTR The reply cache code never returns this status. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2013-02-04 09:16:20 -05:00
Jeff Layton	6dc8889589	nfsd: remove unneeded spinlock in nfsd_cache_update The locking rules for cache entries say that locking the cache_lock isn't needed if you're just touching the current entry. Earlier in this function we set rp->c_state to RC_UNUSED without any locking, so I believe it's ok to do the same here. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2013-02-04 09:16:19 -05:00
Jeff Layton	7b9e8522a6	nfsd: fix IPv6 address handling in the DRC Currently, it only stores the first 16 bytes of any address. struct sockaddr_in6 is 28 bytes however, so we're currently ignoring the last 12 bytes of the address. Expand the c_addr field to a sockaddr_in6, and cast it to a sockaddr_in as necessary. Also fix the comparitor to use the existing RPC helpers for this. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2013-02-04 09:16:19 -05:00
Adam Thomas	8afd500cb5	UBIFS: fix double free of ubifs_orphan objects The last orphan in the dnext list has its dnext set to NULL. Because of that, ubifs_delete_orphan assumes that it is not on the dnext list and frees it immediately instead ignoring it as a second delete. The orphan is later freed again by erase_deleted. This change adds an explicit flag to ubifs_orphan indicating whether it is pending delete. Signed-off-by: Adam Thomas <adamthomas1111@gmail.com> Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Cc: stable@vger.kernel.org	2013-02-04 12:31:48 +02:00
Adam Thomas	2928f0d0c5	UBIFS: fix use of freed ubifs_orphan objects The last orphan in the cnext list has its cnext set to NULL. Because of that, ubifs_delete_orphan assumes that it is not on the cnext list and frees it immediately instead of adding it to the dnext list. The freed orphan is later modified by write_orph_node. This can cause various inconsistencies including directory entries that cannot be removed and this error: UBIFS error (pid 20685): layout_cnodes: LPT out of space at LEB 14:129009 needing 17, done_ltab 1, done_lsave 1 This is a regression introduced by "7074e5eb UBIFS: remove invalid reference to list iterator variable". This change adds an explicit flag to ubifs_orphan indicating whether it is pending commit. Signed-off-by: Adam Thomas <adamthomas1111@gmail.com> Reviewed-by: Adrian Hunter <adrian.hunter@intel.com> Cc: stable@vger.kernel.org # v3.6+ Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>	2013-02-04 12:31:00 +02:00
Thomas Gleixner	90889a635a	Merge branch 'fortglx/3.9/time' of git://git.linaro.org/people/jstultz/linux into timers/core Trivial conflict in arch/x86/Kconfig Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2013-02-04 11:03:03 +01:00
Al Viro	9d94b9e2f3	switch timerfd compat syscalls to COMPAT_SYSCALL_DEFINE ... and move them over to fs/timerfd.c. Cleaner and easier that way... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-02-03 15:09:25 -05:00
Al Viro	f482e1b4a4	switch compat_sys_open* to COMPAT_SYSCALL_DEFINE	2013-02-03 15:09:24 -05:00
Theodore Ts'o	8dc0aa8cf0	ext4: check incompatible mount options while mounting ext2/3 Check for incompatible mount options when using the ext4 file system driver to mount ext2 or ext3 file systems. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2013-02-02 23:38:39 -05:00
Jan Kara	e33e60eaed	ext4: print error when argument of inode_readahead_blk is invalid If argument of inode_readahead_blk is too big, we just bail out without printing any error. Fix this since it could confuse users. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2013-02-02 23:14:31 -05:00
Jan Kara	5f3633e36b	ext4: make mount option parsing loop more logical The loop looking for correct mount option entry is more logical if it is written rewritten as an empty loop looking for correct option entry and then code handling the option. It also saves one level of indentation for a lot of code so we can join a couple of split lines. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2013-02-02 23:09:36 -05:00
Jan Kara	0efb3b2300	ext4: move several mount options to standard handling loop Several mount option (resuid, resgid, journal_dev, journal_ioprio) are currently handled before we enter standard option handling loop. I don't see a reason for this so move them to normal handling loop to make things more regular. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2013-02-02 22:52:19 -05:00
Cong Ding	0e79537d30	ext4: reduce one "if" comparison in ext4_dirhash() It is unnecessary to check i<4 after the loop; just do it before the break. Signed-off-by: Cong Ding <dinggnu@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2013-02-01 22:33:21 -05:00
Niu Yawei	f116700971	ext4: fix race in ext4_mb_add_n_trim() In ext4_mb_add_n_trim(), lg_prealloc_lock should be taken when changing the lg_prealloc_list. Signed-off-by: Niu Yawei <yawei.niu@intel.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org	2013-02-01 21:31:27 -05:00
Akria Fujita	87e698734b	ext4: fix smatch warning in move_extent.c's mext_replace_branches() Commit `2147b1a6a4` resulted in a new smatch warning: > fs/ext4/move_extent.c:693 mext_replace_branches() > warn: variable dereferenced before check 'dext' (see line 683) Fix this by adding a check to make sure dext is non-NULL before we derefrence it. Signed-off-by: Akria Fujita <a-fujita@rs.jp.nec.com> [ modified by tytso to make sure an ext4_error is called ] Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2013-02-01 20:52:46 -05:00
Julia Lawall	524c19ebc9	ext4: use WARN in ext4_alloc_blocks Use WARN rather than printk followed by WARN_ON(1), for conciseness. A simplified version of the semantic patch that makes this transformation is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ expression list es; @@ -printk( +WARN(1, es); -WARN_ON(1); // </smpl> Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2013-02-01 20:07:21 -05:00
Jeff Liu	a21cd50367	xfs: refactor space log reservation for XFS_TRANS_ATTR_SET Currently, we calculate the attribute set transaction log space reservation at runtime in two parts: 1) XFS_ATTRSET_LOG_RES() which is calcuated out at mount time. 2) ((ext * (mp)->m_sb.sb_sectsize) + \ (ext * XFS_FSB_TO_B((mp), XFS_BM_MAXLEVELS(mp, XFS_ATTR_FORK))) + \ (128 * (ext + (ext * XFS_BM_MAXLEVELS(mp, XFS_ATTR_FORK)))))) which is calculated out at runtime since it depend on the given extent length in blocks. This patch renamed XFS_ATTRSET_LOG_RES(mp) to XFS_ATTRSETM_LOG_RES(mp) to indicate that it is figured out at mount time. Introduce XFS_ATTRSETRT_LOG_RES(mp) which would be used to calculate out the unit of the log space reservation for one block. In this way, the total runtime space for the given extent length can be figured out by: XFS_ATTRSETM_LOG_RES(mp) + XFS_ATTRSETRT_LOG_RES(mp) * ext Signed-off-by: Jie Liu <jeff.liu@oracle.com> CC: Dave Chinner <david@fromorbit.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-02-01 14:56:31 -06:00
Jeff Liu	762c585b18	xfs: make use of XFS_SB_LOG_RES() at xfs_fs_log_dummy() Make use of XFS_SB_LOG_RES() at xfs_fs_log_dummy(). Signed-off-by: Jie Liu <jeff.liu@oracle.com> CC: Dave Chinner <david@fromorbit.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-02-01 14:55:59 -06:00
Jeff Liu	5166ab0655	xfs: make use of XFS_SB_LOG_RES() at xfs_mount_log_sb() Make use of XFS_SB_LOG_RES() at xfs_mount_log_sb(). Signed-off-by: Jie Liu <jeff.liu@oracle.com> CC: Dave Chinner <david@fromorbit.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-02-01 14:55:08 -06:00
Jeff Liu	e457274b60	xfs: make use of XFS_SB_LOG_RES() at xfs_log_sbcount() Make use of XFS_SB_LOG_RES() at xfs_log_sbcount(). Signed-off-by: Jie Liu <jeff.liu@oracle.com> CC: Dave Chinner <david@fromorbit.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-02-01 14:47:18 -06:00
Jeff Liu	a7bd794a0f	xfs: introduce XFS_SB_LOG_RES() for transactions that modify sb on disk Introduce a new transaction space reservation XFS_SB_LOG_RES() for those transactions that need to modify the superblock on disk. Signed-off-by: Jie Liu <jeff.liu@oracle.com> CC: Dave Chinner <david@fromorbit.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-02-01 14:46:35 -06:00
Jeff Liu	762d7ba657	xfs: calculate XFS_TRANS_QM_QUOTAOFF_END space log reservation at mount time Convert the calculation for end of quotaoff log space reservation from runtime to mount time. Signed-off-by: Jie Liu <jeff.liu@oracle.com> CC: Dave Chinner <david@fromorbit.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-02-01 14:45:50 -06:00
Jeff Liu	a1bd955754	xfs: calculate XFS_TRANS_QM_QUOTAOFF space log reservation at mount time Convert the calculation of quota off transaction log space reservation from runtime to mount time. Signed-off-by: Jie Liu <jeff.liu@oracle.com> CC: Dave Chinner <david@fromorbit.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-02-01 14:44:29 -06:00
Jeff Liu	4800104438	xfs: calculate XFS_TRANS_QM_DQALLOC space log reservation at mount time The disk quota allocation log space reservation is calcuated at runtime, this patch does it at mount time. Signed-off-by: Jie Liu <jeff.liu@oracle.com> CC: Dave Chinner <david@fromorbit.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-02-01 14:43:51 -06:00
Jeff Liu	f0f2df94fa	xfs: calcuate XFS_TRANS_QM_SETQLIM space log reservation at mount time For adjusting quota limits transactions, we calculate out the log space reservation at runtime, this patch does it at mount time. Signed-off-by: Jie Liu <jeff.liu@oracle.com> CC: Dave Chinner <david@fromorbit.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-02-01 14:43:11 -06:00
Jeff Liu	f910a8c620	xfs: calculate xfs_qm_write_sb_changes() space log reservation at mount time For the transaction that write the incore superblock changes of quota flags to disk, it would reserve the same log space to clear/reset quota flags transaction, hence we can use XFS_TRANS_SBCHANGE_LOG_RES() for it as well. Signed-off-by: Jie Liu <jeff.liu@oracle.com> CC: Dave Chinner <david@fromorbit.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-02-01 14:42:32 -06:00
Jeff Liu	b0c10b983a	xfs: calculate XFS_TRANS_QM_SBCHANGE space log reservation at mount time The transaction log space for clearing/reseting the quota flags is calculated out at runtime, this patch can figure it out at mount time. Signed-off-by: Jie Liu <jeff.liu@oracle.com> CC: Dave Chinner <david@fromorbit.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-02-01 14:40:17 -06:00
Jeff Liu	5b292ae3a9	xfs: make use of xfs_calc_buf_res() in xfs_trans.c Refining the existing reservations with xfs_calc_buf_res() in xfs_trans.c Signed-off-by: Jie Liu <jeff.liu@oracle.com> CC: Dave Chinner <david@fromorbit.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-02-01 14:39:29 -06:00
Bob Peterson	d2b47cfb26	GFS2: Get a block reservation before resizing a file This patch allocates a block reservation structure before growing or shrinking a file. Without this structure, the grow or shink code can reference the bad pointer. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2013-02-01 20:37:33 +00:00
Steven Whitehouse	4506a519f2	GFS2: Split glock lru processing into two parts The intent here is to split the processing of the glock lru list into two parts, so that the selection of glocks and the disposal are separate functions. The plan is then, that further updates can then be made to these functions in the future to improve the selection of glocks and also the efficiency of glock disposal. The new feature which this patch brings is sorting the glocks to be disposed of into glock number (and thus also disk block number) order. Not all glocks will need i/o in order to dispose of them, but some will, and at least we'll generate mostly disk block order i/o now. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2013-02-01 20:36:03 +00:00
Jeff Liu	4f3b57832b	xfs: add a helper to figure out the space log reservation per item Add a new helper xfs_calc_buf_res() to calcuate out the transaction space reservations per item. xfs_buf_log_overhead() is used to figure out the extra space for struct xfs_buf_log_format that gets written into the log for every buffer as well as a log opheader, i.e. struct xlog_op_header. Signed-off-by: Jie Liu <jeff.liu@oracle.com> CC: Dave Chinner <david@fromorbit.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-02-01 14:35:06 -06:00
Chris Mason	bb721703aa	Btrfs: reduce CPU contention while waiting for delayed extent operations We batch up operations to the extent allocation tree, which allows us to deal with the recursive nature of using the extent allocation tree to allocate extents to the extent allocation tree. It also provides a mechanism to sort and collect extent operations, which makes it much more efficient to record extents that are close together. The delayed extent operations must all be finished before the running transaction commits, so we have code to make sure and run a few of the batched operations when closing our transaction handles. This creates a great deal of contention for the locks in the delayed extent operation tree, and also contention for the lock on the extent allocation tree itself. All the extra contention just slows down the operations and doesn't get things done any faster. This commit changes things to use a wait queue instead. As procs want to run the delayed operations, one of them races in and gets permission to hit the tree, and the others step back and wait for progress to be made. Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-02-01 14:24:25 -05:00
Chris Mason	242e18c7c1	Btrfs: reduce lock contention on extent buffer locks The extent buffers have a refs_lock which we use to make coordinate freeing the extent buffer with operations on the radix tree. On tree roots and other extent buffers that very cache hot, this can be highly contended. These are also the extent buffers that are basically pinned in memory. This commit adds code to cmpxchg our way through the ref modifications, and as long as the result of the reference change is still pinned in ram, we skip the expensive spinlock. Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-02-01 14:24:25 -05:00
Chris Mason	8de972b4fa	Btrfs: fix cluster alignment for mount -o ssd With the new raid56 code, we want to make sure we're properly aligning our allocation clusters with -o ssd Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-02-01 14:24:24 -05:00
Chris Mason	6ac0f4884e	Btrfs: add a plugging callback to raid56 writes Buffered writes and DIRECT_IO writes will often break up big contiguous changes to the file into sub-stripe writes. This adds a plugging callback to gather those smaller writes full stripe writes. Example on flash: fio job to do 64K writes in batches of 3 (which makes a full stripe): With plugging: 450MB/s Without plugging: 220MB/s Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-02-01 14:24:24 -05:00
Chris Mason	4ae10b3a13	Btrfs: Add a stripe cache to raid56 The stripe cache allows us to avoid extra read/modify/write cycles by caching the pages we read off the disk. Pages are cached when: * They are read in during a read/modify/write cycle * They are written during a read/modify/write cycle * They are involved in a parity rebuild Pages are not cached if we're doing a full stripe write. We're assuming that a full stripe write won't be followed by another partial stripe write any time soon. This provides a substantial boost in performance for workloads that synchronously modify adjacent offsets in the file, and for the parity rebuild use case in general. The size of the stripe cache isn't tunable (yet) and is set at 1024 entries. Example on flash: dd if=/dev/zero of=/mnt/xxx bs=4K oflag=direct Without the stripe cache -- 2.1MB/s With the stripe cache 21MB/s Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-02-01 14:24:23 -05:00
David Woodhouse	53b381b3ab	Btrfs: RAID5 and RAID6 This builds on David Woodhouse's original Btrfs raid5/6 implementation. The code has changed quite a bit, blame Chris Mason for any bugs. Read/modify/write is done after the higher levels of the filesystem have prepared a given bio. This means the higher layers are not responsible for building full stripes, and they don't need to query for the topology of the extents that may get allocated during delayed allocation runs. It also means different files can easily share the same stripe. But, it does expose us to incorrect parity if we crash or lose power while doing a read/modify/write cycle. This will be addressed in a later commit. Scrub is unable to repair crc errors on raid5/6 chunks. Discard does not work on raid5/6 (yet) The stripe size is fixed at 64KiB per disk. This will be tunable in a later commit. Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-02-01 14:24:23 -05:00
David Woodhouse	64a167011b	Btrfs: add rw argument to merge_bio_hook() We'll want to merge writes so they can fill a full RAID[56] stripe, but not necessarily reads. Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-02-01 11:49:47 -05:00
Eric Sandeen	3c91160808	btrfs: don't try to notify udev about missing devices If we remove a missing device, bdev is null, and if we send that off to btrfs_kobject_uevent we'll panic. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-02-01 11:47:37 -05:00
Trond Myklebust	322b2b9032	Revert "NFS: add nfs_sb_deactive_async to avoid deadlock" This reverts commit `324d003b0c`. The deadlock turned out to be caused by a workqueue limitation that has now been worked around in the RPC code (see comment in rpc_free_task). Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2013-02-01 10:13:48 -05:00
Linus Torvalds	bf6c8a8148	NFS client bugfixe for Linux 3.8 - Error reporting in nfs_xdev_mount incorrectly maps all errors to ENOMEM - Fix an NFSv4 refcounting issue - Fix a mount failure when the server reboots during NFSv4 trunking discovery - NFSv4.1 mounts may need to run the lease recovery thread. - Don't silently fail setattr() requests on mountpoints - Fix a SUNRPC socket/transport livelock and priority queue issue - We must handle NFS4ERR_DELAY when resetting the NFSv4.1 session. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.13 (GNU/Linux) iQIcBAABAgAGBQJRCpS4AAoJEGcL54qWCgDyqucP/2CTv5leu+X5/0PXBOykAIHg 8oEsTEz7/4IvIxTXzuHDYirMnm/mulfGF6NrdPqfvxpAHqRfVBfLFocfNLMVhQci 97RmBfEEGM22AToYUubML5bIxr0QllV4s9Vmyh/zGDan52y7zNNlZX+v6aLjZbJB Fbolihpcch6lhQEUNAzK0B0ddimDl9lazx/WTmMOD/JrwOqzA4FJC+YxBe88nfzQ c6sYyEptBaSirbCOlueqGpv8skB1CLpFJXguXToPXFxpWed6uoGrIwLO7MLUdFpJ Xw+j8cuv/wjyYJGVKjhW7kXtwK8T7+u4bT2L883R01XYXr8XfkkLON0dgG1X/unk 80mLzCO1+qRdoDSQ4b/V4B0nScPRCJuoZpftjCi2uhKewNcxQPMZ2V5/D7pO3uyE NdhxByB8D86JfNrIcBcRaxfuiQsurQBDvsDNmWPBZSOmH/dmHqTQGLcIe6N94A0B c7KFtXrN2MzOl8S68dUpbhftObq9X0oK2oxFFLWRQoqjrFtiDLU5JilV0bEH2CMo gJX7CRrPoJ2wmNToKdPJRWBYDmtMMIThq3vIpCj1FFzX17r5grJpqCmp/TViu/ER r8rmJzni+nmHaO1NLJMTCHzLJ8soiKyBq8PZKAbipTJxy+TXI/69jnWzpIQ/pbM/ JN+0tiCcqmUmXsO+hrCP =lWFV -----END PGP SIGNATURE----- Merge tag 'nfs-for-3.8-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs Pull NFS client bugfixes from Trond Myklebust: - Error reporting in nfs_xdev_mount incorrectly maps all errors to ENOMEM - Fix an NFSv4 refcounting issue - Fix a mount failure when the server reboots during NFSv4 trunking discovery - NFSv4.1 mounts may need to run the lease recovery thread. - Don't silently fail setattr() requests on mountpoints - Fix a SUNRPC socket/transport livelock and priority queue issue - We must handle NFS4ERR_DELAY when resetting the NFSv4.1 session. * tag 'nfs-for-3.8-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: NFSv4.1: Handle NFS4ERR_DELAY when resetting the NFSv4.1 session SUNRPC: When changing the queue priority, ensure that we change the owner NFS: Don't silently fail setattr() requests on mountpoints NFSv4.1: Ensure that nfs41_walk_client_list() does start lease recovery NFSv4: Fix NFSv4 trunking discovery NFSv4: Fix NFSv4 reference counting for trunked sessions NFS: Fix error reporting in nfs_xdev_mount	2013-02-01 08:43:52 +11:00
Feng Shuo	4582a4ab2a	FUSE: Adapt readdirplus to application usage patterns Use the same adaptive readdirplus mechanism as NFS: http://permalink.gmane.org/gmane.linux.nfs/49299 If the user space implementation wants to disable readdirplus temporarily, it could just return ENOTSUPP. Then kernel will recall it with readdir. Signed-off-by: Feng Shuo <steve.shuo.feng@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2013-01-31 17:08:11 +01:00
Anatol Pomozov	c2132c1bc7	Do not use RCU for current process credentials Commit `c69e8d9c0` added rcu lock to fuse/dir.c It was assuming that 'task' is some other process but in fact this parameter always equals to 'current'. Inline this parameter to make it more readable and remove RCU lock as it is not needed when access current process credentials. Signed-off-by: Anatol Pomozov <anatol.pomozov@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2013-01-31 17:08:10 +01:00
Trond Myklebust	c489ee290b	NFSv4.1: Handle NFS4ERR_DELAY when resetting the NFSv4.1 session NFS4ERR_DELAY is a legal reply when we call DESTROY_SESSION. It usually means that the server is busy handling an unfinished RPC request. Just sleep for a second and then retry. We also need to be able to handle the NFS4ERR_BACK_CHAN_BUSY return value. If the NFS server has outstanding callbacks, we just want to similarly sleep & retry. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: stable@vger.kernel.org	2013-01-30 17:45:15 -05:00
Trond Myklebust	ab22541782	NFS: Don't silently fail setattr() requests on mountpoints Ensure that any setattr and getattr requests for junctions and/or mountpoints are sent to the server. Ever since commit `0ec26fd069` (vfs: automount should ignore LOOKUP_FOLLOW), we have silently dropped any setattr requests to a server-side mountpoint. For referrals, we have silently dropped both getattr and setattr requests. This patch restores the original behaviour for setattr on mountpoints, and tries to do the same for referrals, provided that we have a filehandle... Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: stable@vger.kernel.org	2013-01-30 17:41:04 -05:00
Alex Elder	969e5aa3b0	Merge branch 'testing' of github.com:ceph/ceph-client into v3.8-rc5-testing	2013-01-30 07:54:34 -06:00
Eric Sandeen	e7b04ac00e	jbd2: don't wake kjournald unnecessarily Don't send an extra wakeup to kjournald in the case where we already have the proper target in j_commit_request, i.e. that transaction has already been requested for commit. commit `deeeaf13` "jbd2: fix fsync() tid wraparound bug" changed the logic leading to a wakeup, but it caused some extra wakeups which were found to lead to a measurable performance regression. Signed-off-by: Eric Sandeen <sandeen@redhat.com> [tytso@mit.edu: reworked check to make it clearer] Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2013-01-30 00:39:28 -05:00
Jan Kara	091e26dfc1	ext4: fix possible use-after-free with AIO Running AIO is pinning inode in memory using file reference. Once AIO is completed using aio_complete(), file reference is put and inode can be freed from memory. So we have to be sure that calling aio_complete() is the last thing we do with the inode. CC: stable@vger.kernel.org Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2013-01-29 22:48:17 -05:00
Linus Torvalds	f96736e1ba	xfs: bugfixes for 3.8-rc6 - fix return value when filesystem probe finds no XFS magic, a regression introduced in `9802182`. - fix stack switch in __xfs_bmapi_allocate by moving the check for stack switch up into xfs_bmapi_write. - fix oops in _xfs_buf_find by validating that the requested block is within the filesystem bounds. - limit speculative preallocation near ENOSPC. - fix an unmount hang in xfs_wait_buftarg by freeing the xfs_buf_log_item in xfs_buf_item_unlock. - fix a possible use after free with AIO. - fix xfs_swap_extents after removal of xfs_flushinval_pages, a regression introduced in `fb59581404`. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) iQIcBAABAgAGBQJRBvmgAAoJENaLyazVq6ZOOacP/RilCPsi41NkJqRx1Rs5aRGE UvinrHfAL/tBupS2JVo1niIilBNJG/cI+lLcpV/P5omLBJfpEu0trzZUSxU7S1Vc a2M8J0qhmKfBcl70fuCALAxPY52895y+44gufxaH0O5HQDN6tB8n4MMqYGPmS8hz Ul/q3MO601hVyBHaoYa7BNGS3YG0TCdFGtWcC5tQaR3v7upTLR2ouZrGQ8CV0BBa Ek1xdxLh4D0fRybSL7lUw64W957iyldoLsEg+zQrE9NSfTE8DSqUG+NPWB0wjPce ICtmO6TbE5c6q1ScOL3YCC2cmYvjR9mlAHnPy73SqWSIsTqUsVzdibNo+tUJJZ5r RZf3u6Uri6uKC6Hl4XEtg4LVnnquKosTXfoiHmn+eh0dhYL7sZG0Ya5we5pH5Tmi P6B2DlfUA1fj4Ne4Asx2d7mwOJaZcLHDZoeCs/Haz2Z6kGVEm7ImyAb1h76uNOZo l0NFhXJGcOQLyjPtQjl81SGjQmntiIN0Poia3528zjxxGXlBNwAwalkOtdJnk5iN IaYRKtvIcrdjFvunasiKZIsV/O9w3/mguXlrSqBDgUsKPUc/cq5vLfsa70jYGc2j M6ldJRRqTvSjkVXc/7SXv4GLt/qbUWa92ESzhZQXEABIjZJUnOZpZuLmSVdRUyZk +SaGvbMphE0U/4ps3pO/ =wViN -----END PGP SIGNATURE----- Merge tag 'for-linus-v3.8-rc6' of git://oss.sgi.com/xfs/xfs Pull xfs bugfixes from Ben Myers: "Here are fixes for returning EFSCORRUPTED on probe of a non-xfs filesystem, the stack switch in xfs_bmapi_allocate, a crash in _xfs_buf_find, speculative preallocation as the filesystem nears ENOSPC, an unmount hang, a race with AIO, and a regression with xfs_fsr: - fix return value when filesystem probe finds no XFS magic, a regression introduced in `9802182`. - fix stack switch in __xfs_bmapi_allocate by moving the check for stack switch up into xfs_bmapi_write. - fix oops in _xfs_buf_find by validating that the requested block is within the filesystem bounds. - limit speculative preallocation near ENOSPC. - fix an unmount hang in xfs_wait_buftarg by freeing the xfs_buf_log_item in xfs_buf_item_unlock. - fix a possible use after free with AIO. - fix xfs_swap_extents after removal of xfs_flushinval_pages, a regression introduced in commit fb59581404a." * tag 'for-linus-v3.8-rc6' of git://oss.sgi.com/xfs/xfs: xfs: Fix xfs_swap_extents() after removal of xfs_flushinval_pages() xfs: Fix possible use-after-free with AIO xfs: fix shutdown hang on invalid inode during create xfs: limit speculative prealloc near ENOSPC thresholds xfs: fix _xfs_buf_find oops on blocks beyond the filesystem end xfs: pull up stack_switch check into xfs_bmapi_write xfs: Do not return EFSCORRUPTED when filesystem probe finds no XFS magic	2013-01-30 11:59:37 +11:00
majianpeng	885c91f746	nfsd: Fix memleak in svc_export_put In func svc_export_parse, the uuid which used kmemdup to alloc will be changed in func export_update.So the later kfree don't free this memory. And it can't be free in func svc_export_parse because other place still used.So put this operation in func svc_export_put. Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2013-01-29 16:50:03 -05:00
Steven Whitehouse	4513899092	GFS2: Use ->writepages for ordered writes Instead of using a list of buffers to write ahead of the journal flush, this now uses a list of inodes and calls ->writepages via filemap_fdatawrite() in order to achieve the same thing. For most use cases this results in a shorter ordered write list, as well as much larger i/os being issued. The ordered write list is sorted by inode number before writing in order to retain the disk block ordering between inodes as per the previous code. The previous ordered write code used to conflict in its assumptions about how to write out the disk blocks with mpage_writepages() so that with this updated version we can also use mpage_writepages() for GFS2's ordered write, writepages implementation. So we will also send larger i/os from writeback too. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2013-01-29 10:29:17 +00:00
Steven Whitehouse	d564053f07	GFS2: Clean up freeze code The freeze code has not been looked at a lot recently. Upstream has moved on, and this is an attempt to catch us back up again. There is a vfs level interface for the freeze code which can be called from our (obsolete, but kept for backward compatibility purposes) sysfs freeze interface. This means freezing this way vs. doing it from the ioctl should now work in identical fashion. As a result of this, the freeze function is only called once and we can drop our own special purpose code for counting the number of freezes. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2013-01-29 10:29:05 +00:00
Steven Whitehouse	c76c4d96bd	GFS2: Merge gfs2_attach_bufdata() into trans.c The locking in gfs2_attach_bufdata() was type specific (data/meta) which made the function rather confusing. This patch moves the core of gfs2_attach_bufdata() into trans.c renaming it gfs2_alloc_bufdata() and moving the locking into gfs2_trans_add_data()/gfs2_trans_add_meta() As a result all of the locking related to adding data and metadata to the journal is now in these two functions. This should help to clarify what is going on, and give us some opportunities to simplify in some cases. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2013-01-29 10:28:44 +00:00
Steven Whitehouse	767f433f34	GFS2: Copy gfs2_trans_add_bh into new data/meta functions This patch copies the body of gfs2_trans_add_bh into the two newly added gfs2_trans_add_data and gfs2_trans_add_meta functions. We can then move the .lo_add functions from lops.c into trans.c and call them directly. As a result of this, we no longer need to use the .lo_add functions at all, so that is removed from the log operations structure. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2013-01-29 10:28:28 +00:00
Steven Whitehouse	350a9b0a72	GFS2: Split gfs2_trans_add_bh() into two There is little common content in gfs2_trans_add_bh() between the data and meta classes by the time that the functions which it calls are taken into account. The intent here is to split this into two separate functions. Stage one is to introduce gfs2_trans_add_data() and gfs2_trans_add_meta() and update the callers accordingly. Later patches will then pull in the content of gfs2_trans_add_bh() and its dependent functions in order to clean up the code in this area. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2013-01-29 10:28:04 +00:00
Steven Whitehouse	75f2b879ae	GFS2: Merge revoke adding functions This moves the lo_add function for revokes into trans.c, removing a function call and making the code easier to read. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2013-01-29 10:27:46 +00:00
Steven Whitehouse	2a00585593	GFS2: Separate LRU scanning from shrinker This breaks out the LRU scanning function from the shrinker in preparation for adding other callers to the LRU scanner. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2013-01-29 10:27:28 +00:00
Jiri Kosina	617677295b	Merge branch 'master' into for-next Conflicts: drivers/devfreq/exynos4_bus.c Sync with Linus' tree to be able to apply patches that are against newer code (mvneta).	2013-01-29 10:48:30 +01:00
Guo Chao	b1deefc99e	ext4: remove unnecessary NULL pointer check brelse() and ext4_journal_force_commit() are both inlined and able to handle NULL. Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2013-01-28 21:41:02 -05:00
Guo Chao	41be871f74	ext4: remove useless assignment in dx_probe() Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2013-01-28 21:33:28 -05:00
Guo Chao	2bbbee2a68	ext4: remove unused variable in add_dirent_to_buf() After commit `978fef9` (create __ext4_insert_dentry for dir entry insertion), 'reclen' is not used anymore. Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>	2013-01-28 21:26:44 -05:00
Guo Chao	d5ac777305	ext4: release buffer when checksum failed Commit `b0336e8d` (ext4: calculate and verify checksums of directory leaf blocks) and commit `dbe89444` (ext4: Calculate and verify checksums for htree nodes) forget to release buffer when checksum failed, at some places. Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>	2013-01-28 21:23:24 -05:00
Lukas Czerner	b06acd38a4	ext4: remove explicit WARN_ON when ext4_map_blocks() fails In two places we call WARN_ON() before we print out the debug message, however we agreed that the WARN_ON() is unnecessary at those places so remove them. Also use ext4_warning() instead of ext4_msg() and printk(). Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2013-01-28 21:21:12 -05:00
Lukas Czerner	cfa7275482	ext4: remove unused variable flags Remove unused variable flags from dump_completed_IO(). The code is only exercised when EXT4FS_DEBUG is defined. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>	2013-01-28 21:14:11 -05:00
Jan Kara	fe386132f6	ext4: fix ext4_writepage() to achieve data=ordered guarantees So far ext4_writepage() skipped writing pages that had any delayed or unwritten buffers attached. When blocksize < pagesize this breaks data=ordered mode guarantees as we can have a page with one freshly allocated buffer whose allocation is part of the committing transaction and another buffer in the page which is delayed or unwritten. So fix this problem by calling ext4_bio_writepage() anyway. It will submit mapped buffers and leave others alone. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2013-01-28 21:06:42 -05:00
Jan Kara	8a850c3fb8	ext4: Make ext4_bio_writepage() handle unprepared buffers So far ext4_bio_writepage() unconditionally cleared dirty bit on all buffers underlying the page. That implicitely assumes we can write all buffers. So far that is true because callers call into ext4_bio_writepage() make sure all buffers in the page are mapped but: a) it's a data corruption bug waiting to happen b) in data=ordered mode when blocksize < pagesize we do need to write pages that may have only some of dirty buffers mapped. So change ext4_bio_writepage() to skip buffers that cannot be written without clearing their dirty bit. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2013-01-28 20:53:28 -05:00
Torsten Kaiser	65e3aa77f1	xfs: Fix xfs_swap_extents() after removal of xfs_flushinval_pages() Commit `fb59581404` removed xfs_flushinval_pages() and changed its callers to use filemap_write_and_wait() and truncate_pagecache_range() directly. But in xfs_swap_extents() this change accidental switched the argument for 'tip' to 'ip'. This patch switches it back to 'tip' Signed-off-by: Torsten Kaiser <just.for.lkml@googlemail.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-01-28 16:05:10 -06:00
Torsten Kaiser	2729423cf2	xfs: Fix xfs_swap_extents() after removal of xfs_flushinval_pages() Commit `fb59581404` removed xfs_flushinval_pages() and changed its callers to use filemap_write_and_wait() and truncate_pagecache_range() directly. But in xfs_swap_extents() this change accidental switched the argument for 'tip' to 'ip'. This patch switches it back to 'tip' Signed-off-by: Torsten Kaiser <just.for.lkml@googlemail.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-01-28 13:50:10 -06:00
Jan Kara	4b05d09c18	xfs: Fix possible use-after-free with AIO Running AIO is pinning inode in memory using file reference. Once AIO is completed using aio_complete(), file reference is put and inode can be freed from memory. So we have to be sure that calling aio_complete() is the last thing we do with the inode. CC: xfs@oss.sgi.com CC: Ben Myers <bpm@sgi.com> CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-01-28 12:51:22 -06:00
Dave Chinner	9f87832a82	xfs: fix shutdown hang on invalid inode during create When the new inode verify in xfs_iread() fails, the create transaction is aborted and a shutdown occurs. The subsequent unmount then hangs in xfs_wait_buftarg() on a buffer that has an elevated hold count. Debug showed that it was an AGI buffer getting stuck: [ 22.576147] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck [ 22.976213] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck [ 23.376206] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck [ 23.776325] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck The trace of this buffer leading up to the shutdown (trimmed for brevity) looks like: xfs_buf_init: bno 0x2 nblks 0x1 hold 1 caller xfs_buf_get_map xfs_buf_get: bno 0x2 len 0x200 hold 1 caller xfs_buf_read_map xfs_buf_read: bno 0x2 len 0x200 hold 1 caller xfs_trans_read_buf_map xfs_buf_iorequest: bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_read xfs_buf_hold: bno 0x2 nblks 0x1 hold 1 caller xfs_buf_iorequest xfs_buf_rele: bno 0x2 nblks 0x1 hold 2 caller xfs_buf_iorequest xfs_buf_iowait: bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_read xfs_buf_ioerror: bno 0x2 len 0x200 hold 1 caller xfs_buf_bio_end_io xfs_buf_iodone: bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_ioend xfs_buf_iowait_done: bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_read xfs_buf_hold: bno 0x2 nblks 0x1 hold 1 caller xfs_buf_item_init xfs_trans_read_buf: bno 0x2 len 0x200 hold 2 recur 0 refcount 1 xfs_trans_brelse: bno 0x2 len 0x200 hold 2 recur 0 refcount 1 xfs_buf_item_relse: bno 0x2 nblks 0x1 hold 2 caller xfs_trans_brelse xfs_buf_rele: bno 0x2 nblks 0x1 hold 2 caller xfs_buf_item_relse xfs_buf_unlock: bno 0x2 nblks 0x1 hold 1 caller xfs_trans_brelse xfs_buf_rele: bno 0x2 nblks 0x1 hold 1 caller xfs_trans_brelse xfs_buf_trylock: bno 0x2 nblks 0x1 hold 2 caller _xfs_buf_find xfs_buf_find: bno 0x2 len 0x200 hold 2 caller xfs_buf_get_map xfs_buf_get: bno 0x2 len 0x200 hold 2 caller xfs_buf_read_map xfs_buf_read: bno 0x2 len 0x200 hold 2 caller xfs_trans_read_buf_map xfs_buf_hold: bno 0x2 nblks 0x1 hold 2 caller xfs_buf_item_init xfs_trans_read_buf: bno 0x2 len 0x200 hold 3 recur 0 refcount 1 xfs_trans_log_buf: bno 0x2 len 0x200 hold 3 recur 0 refcount 1 xfs_buf_item_unlock: bno 0x2 len 0x200 hold 3 flags DIRTY liflags ABORTED xfs_buf_unlock: bno 0x2 nblks 0x1 hold 3 caller xfs_buf_item_unlock xfs_buf_rele: bno 0x2 nblks 0x1 hold 3 caller xfs_buf_item_unlock And that is the AGI buffer from cold cache read into memory to transaction abort. You can see at transaction abort the bli is dirty and only has a single reference. The item is not pinned, and it's not in the AIL. Hence the only reference to it is this transaction. The problem is that the xfs_buf_item_unlock() call is dropping the last reference to the xfs_buf_log_item attached to the buffer (which holds a reference to the buffer), but it is not freeing the xfs_buf_log_item. Hence nothing will ever release the buffer, and the unmount hangs waiting for this reference to go away. The fix is simple - xfs_buf_item_unlock needs to detect the last reference going away in this case and free the xfs_buf_log_item to release the reference it holds on the buffer. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-01-28 12:51:12 -06:00
Dave Chinner	f2a459565b	xfs: limit speculative prealloc near ENOSPC thresholds There is a window on small filesytsems where specualtive preallocation can be larger than that ENOSPC throttling thresholds, resulting in specualtive preallocation trying to reserve more space than there is space available. This causes immediate ENOSPC to be triggered, prealloc to be turned off and flushing to occur. One the next write (i.e. next 4k page), we do exactly the same thing, and so effective drive into synchronous 4k writes by triggering ENOSPC flushing on every page while in the window between the prealloc size and the ENOSPC prealloc throttle threshold. Fix this by checking to see if the prealloc size would consume all free space, and throttle it appropriately to avoid premature ENOSPC... Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-01-28 12:50:50 -06:00
Dave Chinner	eb178619f9	xfs: fix _xfs_buf_find oops on blocks beyond the filesystem end When _xfs_buf_find is passed an out of range address, it will fail to find a relevant struct xfs_perag and oops with a null dereference. This can happen when trying to walk a filesystem with a metadata inode that has a partially corrupted extent map (i.e. the block number returned is corrupt, but is otherwise intact) and we try to read from the corrupted block address. In this case, just fail the lookup. If it is readahead being issued, it will simply not be done, but if it is real read that fails we will get an error being reported. Ideally this case should result in an EFSCORRUPTED error being reported, but we cannot return an error through xfs_buf_read() or xfs_buf_get() so this lookup failure may result in ENOMEM or EIO errors being reported instead. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-01-28 12:49:21 -06:00
Brian Foster	d26978dd86	xfs: pull up stack_switch check into xfs_bmapi_write The stack_switch check currently occurs in __xfs_bmapi_allocate, which means the stack switch only occurs when xfs_bmapi_allocate() is called in a loop. Pull the check up before the loop in xfs_bmapi_write() such that the first iteration of the loop has consistent behavior. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-01-28 12:48:55 -06:00
Eric Sandeen	1bee12b8c4	xfs: Do not return EFSCORRUPTED when filesystem probe finds no XFS magic `9802182` changed the return value from EWRONGFS (aka EINVAL) to EFSCORRUPTED which doesn't seem to be handled properly by the root filesystem probe. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Tested-by: Sergei Trofimovich <slyfox@gentoo.org> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-01-28 12:48:21 -06:00
Jan Kara	b6a8e62f8b	ext4: simplify mpage_add_bh_to_extent() The argument b_size of mpage_add_bh_to_extent() was bogus since it was always == blocksize (which we can easily derive from inode->i_blkbits). Also second branch of condition: if (nrblocks >= EXT4_MAX_TRANS_DATA) { } else if ((nrblocks + (b_size >> mpd->inode->i_blkbits)) > EXT4_MAX_TRANS_DATA) { } was never taken because (b_size >> mpd->inode->i_blkbits) == 1. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2013-01-28 13:06:48 -05:00
Jan Kara	f8bec37037	ext4: dirty page has always buffers attached ext4_writepage(), write_cache_pages_da(), and mpage_da_submit_io() doesn't have to deal with the case when page doesn't have buffers. We attach buffers to a page in ->write_begin() and ->page_mkwrite() which covers all places where a page can become dirty. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2013-01-28 12:55:08 -05:00
Jan Kara	002bd7fa3a	ext4: simplify list handling in ext4_do_flush_completed_IO() The function splices i_completed_io_list to its private list first. From that moment on we don't need any lock for working with io_end structures because all io_end structure on the list are only our own. So we can remove the other two lists in the function and free io_end immediately after we are done with it. CC: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2013-01-28 09:49:15 -05:00
Jan Kara	84c17543ab	ext4: move work from io_end to inode It does not make much sense to have struct work in ext4_io_end_t because we always use it for only one ext4_io_end_t per inode (the first one in the i_completed_io list). So just move the structure to inode itself. This also allows for a small simplification in processing io_end structures. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2013-01-28 09:43:46 -05:00
Jan Kara	fe089c77f1	ext4: remove __ext4_journalled_writepage() from mpage_da_submit_io() We don't support delayed allocation in data=journal mode. So checking for it in mpage_da_submit_io() doesn't make really sence. If we ever decide to extend delayed allocation support to data=journal mode, adding __ext4_journalled_writepage() call will be the least of problems we have to solve. Most likely we'd have to implement separate writepages call anyways because we don't have transaction credits for writing more than a single page so mapping of page buffers would have to be done differently. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2013-01-28 09:38:49 -05:00
Jan Kara	1ae48a6354	ext4: use redirty_page_for_writepage() in ext4_bio_write_page() When we cannot write a page we should use redirty_page_for_writepage() instead of plain set_page_dirty(). That tells writeback code we have problems, redirties only the page (redirtying buffers is not needed), and updates mm accounting of failed page writes. Also move clearing of buffer dirty flag after io_submit_add_bh(). At that moment we are sure buffer will be going to disk. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2013-01-28 09:32:54 -05:00
Jan Kara	36ade451a5	ext4: Always use ext4_bio_write_page() for writeout Currently we sometimes used block_write_full_page() and sometimes ext4_bio_write_page() for writeback (depending on mount options and call path). Let's always use ext4_bio_write_page() to simplify things a bit. Reviewed-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2013-01-28 09:30:52 -05:00
Zheng Liu	8bad6fc813	ext4: add punching hole support for non-extent-mapped files This patch add supports for indirect file support punching hole. It is almost the same as ext4_ext_punch_hole. First, we invalidate all pages between this hole, and then we try to deallocate all blocks of this hole. A recursive function is used to handle deallocation of blocks. In this function, it iterates over the entries in inode's i_blocks or indirect blocks, and try to free the block for each one of them. After applying this patch, xfstest #255 will not pass w/o extent because indirect-based file doesn't support unwritten extents. Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2013-01-28 09:21:37 -05:00
David Teigland	d4e0bfec9b	GFS2: fix skip unlock condition The recent commit `fb6791d100` included the wrong logic. The lvbptr check was incorrectly added after the patch was tested. Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2013-01-28 09:49:15 +00:00
Trond Myklebust	65436ec0c8	NFSv4.1: Ensure that nfs41_walk_client_list() does start lease recovery We do need to start the lease recovery thread prior to waiting for the client initialisation to complete in NFSv4.1. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Ben Greear <greearb@candelatech.com> Cc: stable@vger.kernel.org [>=3.7]	2013-01-27 15:51:41 -05:00
Trond Myklebust	202c312dba	NFSv4: Fix NFSv4 trunking discovery If walking the list in nfs4[01]_walk_client_list fails, then the most likely explanation is that the server dropped the clientid before we actually managed to confirm it. As long as our nfs_client is the very last one in the list to be tested, the caller can be assured that this is the case when the final return value is NFS4ERR_STALE_CLIENTID. Reported-by: Ben Greear <greearb@candelatech.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: stable@vger.kernel.org [>=3.7] Tested-by: Ben Greear <greearb@candelatech.com>	2013-01-27 15:51:28 -05:00
Trond Myklebust	4ae19c2dd7	NFSv4: Fix NFSv4 reference counting for trunked sessions The reference counting in nfs4_init_client assumes wongly that it is safe for nfs4_discover_server_trunking() to return a pointer to a nfs_client prior to bumping the reference count. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Ben Greear <greearb@candelatech.com> Cc: stable@vger.kernel.org [>=3.7]	2013-01-27 15:51:15 -05:00
Trond Myklebust	dee972b967	NFS: Fix error reporting in nfs_xdev_mount Currently, nfs_xdev_mount converts all errors from clone_server() to ENOMEM, which can then leak to userspace (for instance to 'mount'). Fix that. Also ensure that if nfs_fs_mount_common() returns an error, we don't dprintk(0)... The regression originated in commit `3d176e3fe4` (NFS: Use nfs_fs_mount_common() for xdev mounts) Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: stable@vger.kernel.org [>= 3.5]	2013-01-27 15:51:15 -05:00
Frederic Weisbecker	6fac4829ce	cputime: Use accessors to read task cputime stats This is in preparation for the full dynticks feature. While remotely reading the cputime of a task running in a full dynticks CPU, we'll need to do some extra-computation. This way we can account the time it spent tickless in userspace since its last cputime snapshot. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Namhyung Kim <namhyung.kim@lge.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de>	2013-01-27 19:23:31 +01:00
Eric W. Biederman	b3c6761d9b	userns: Allow the userns root to mount ramfs. There is no backing store to ramfs and file creation rules are the same as for any other filesystem so it is semantically safe to allow unprivileged users to mount it. The memory control group successfully limits how much memory ramfs can consume on any system that cares about a user namespace root using ramfs to exhaust memory the memory control group can be deployed. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2013-01-26 22:22:38 -08:00
Eric W. Biederman	ec2aa8e8dd	userns: Allow the userns root to mount of devpts - The context in which devpts is mounted has no effect on the creation of ptys as the /dev/ptmx interface has been used by unprivileged users for many years. - Only support unprivileged mounts in combination with the newinstance option to ensure that mounting of /dev/pts in a user namespace will not allow the options of an existing mount of devpts to be modified. - Create /dev/pts/ptmx as the root user in the user namespace that mounts devpts so that it's permissions to be changed. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2013-01-26 22:22:21 -08:00
Jan Kara	ced55f38d6	xfs: Fix possible use-after-free with AIO Running AIO is pinning inode in memory using file reference. Once AIO is completed using aio_complete(), file reference is put and inode can be freed from memory. So we have to be sure that calling aio_complete() is the last thing we do with the inode. CC: xfs@oss.sgi.com CC: Ben Myers <bpm@sgi.com> CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-01-26 09:43:58 -06:00
Dave Chinner	3b19034d4f	xfs: fix shutdown hang on invalid inode during create When the new inode verify in xfs_iread() fails, the create transaction is aborted and a shutdown occurs. The subsequent unmount then hangs in xfs_wait_buftarg() on a buffer that has an elevated hold count. Debug showed that it was an AGI buffer getting stuck: [ 22.576147] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck [ 22.976213] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck [ 23.376206] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck [ 23.776325] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck The trace of this buffer leading up to the shutdown (trimmed for brevity) looks like: xfs_buf_init: bno 0x2 nblks 0x1 hold 1 caller xfs_buf_get_map xfs_buf_get: bno 0x2 len 0x200 hold 1 caller xfs_buf_read_map xfs_buf_read: bno 0x2 len 0x200 hold 1 caller xfs_trans_read_buf_map xfs_buf_iorequest: bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_read xfs_buf_hold: bno 0x2 nblks 0x1 hold 1 caller xfs_buf_iorequest xfs_buf_rele: bno 0x2 nblks 0x1 hold 2 caller xfs_buf_iorequest xfs_buf_iowait: bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_read xfs_buf_ioerror: bno 0x2 len 0x200 hold 1 caller xfs_buf_bio_end_io xfs_buf_iodone: bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_ioend xfs_buf_iowait_done: bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_read xfs_buf_hold: bno 0x2 nblks 0x1 hold 1 caller xfs_buf_item_init xfs_trans_read_buf: bno 0x2 len 0x200 hold 2 recur 0 refcount 1 xfs_trans_brelse: bno 0x2 len 0x200 hold 2 recur 0 refcount 1 xfs_buf_item_relse: bno 0x2 nblks 0x1 hold 2 caller xfs_trans_brelse xfs_buf_rele: bno 0x2 nblks 0x1 hold 2 caller xfs_buf_item_relse xfs_buf_unlock: bno 0x2 nblks 0x1 hold 1 caller xfs_trans_brelse xfs_buf_rele: bno 0x2 nblks 0x1 hold 1 caller xfs_trans_brelse xfs_buf_trylock: bno 0x2 nblks 0x1 hold 2 caller _xfs_buf_find xfs_buf_find: bno 0x2 len 0x200 hold 2 caller xfs_buf_get_map xfs_buf_get: bno 0x2 len 0x200 hold 2 caller xfs_buf_read_map xfs_buf_read: bno 0x2 len 0x200 hold 2 caller xfs_trans_read_buf_map xfs_buf_hold: bno 0x2 nblks 0x1 hold 2 caller xfs_buf_item_init xfs_trans_read_buf: bno 0x2 len 0x200 hold 3 recur 0 refcount 1 xfs_trans_log_buf: bno 0x2 len 0x200 hold 3 recur 0 refcount 1 xfs_buf_item_unlock: bno 0x2 len 0x200 hold 3 flags DIRTY liflags ABORTED xfs_buf_unlock: bno 0x2 nblks 0x1 hold 3 caller xfs_buf_item_unlock xfs_buf_rele: bno 0x2 nblks 0x1 hold 3 caller xfs_buf_item_unlock And that is the AGI buffer from cold cache read into memory to transaction abort. You can see at transaction abort the bli is dirty and only has a single reference. The item is not pinned, and it's not in the AIL. Hence the only reference to it is this transaction. The problem is that the xfs_buf_item_unlock() call is dropping the last reference to the xfs_buf_log_item attached to the buffer (which holds a reference to the buffer), but it is not freeing the xfs_buf_log_item. Hence nothing will ever release the buffer, and the unmount hangs waiting for this reference to go away. The fix is simple - xfs_buf_item_unlock needs to detect the last reference going away in this case and free the xfs_buf_log_item to release the reference it holds on the buffer. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-01-26 09:34:38 -06:00
Greg Kroah-Hartman	422d26b6ec	Merge 3.8-rc5 into driver-core-next This resolves a gpio driver merge issue pointed out in linux-next. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2013-01-25 21:06:30 -08:00
Greg Kroah-Hartman	9f9cba810f	Merge 3.8-rc5 into tty-next This resolves a number of tty driver merge issues found in linux-next Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2013-01-25 13:27:36 -08:00
Rafael J. Wysocki	0bb8f3d6ae	sysfs: Functions for adding/removing symlinks to/from attribute groups The most convenient way to expose ACPI power resources lists of a device is to put symbolic links to sysfs directories representing those resources into special attribute groups in the device's sysfs directory. For this purpose, it is necessary to be able to add symbolic links to attribute groups. For this reason, add sysfs helper functions for adding/removing symbolic links to/from attribute groups, sysfs_add_link_to_group() and sysfs_remove_link_from_group(), respectively. This change set includes a build fix from David Rientjes. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2013-01-25 21:51:13 +01:00
Linus Torvalds	d7df025eb4	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "It turns out that we had two crc bugs when running fsx-linux in a loop. Many thanks to Josef, Miao Xie, and Dave Sterba for nailing it all down. Miao also has a new OOM fix in this v2 pull as well. Ilya fixed a regression Liu Bo found in the balance ioctls for pausing and resuming a running balance across drives. Josef's orphan truncate patch fixes an obscure corruption we'd see during xfstests. Arne's patches address problems with subvolume quotas. If the user destroys quota groups incorrectly the FS will refuse to mount. The rest are smaller fixes and plugs for memory leaks." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (30 commits) Btrfs: fix repeated delalloc work allocation Btrfs: fix wrong max device number for single profile Btrfs: fix missed transaction->aborted check Btrfs: Add ACCESS_ONCE() to transaction->abort accesses Btrfs: put csums on the right ordered extent Btrfs: use right range to find checksum for compressed extents Btrfs: fix panic when recovering tree log Btrfs: do not allow logged extents to be merged or removed Btrfs: fix a regression in balance usage filter Btrfs: prevent qgroup destroy when there are still relations Btrfs: ignore orphan qgroup relations Btrfs: reorder locks and sanity checks in btrfs_ioctl_defrag Btrfs: fix unlock order in btrfs_ioctl_rm_dev Btrfs: fix unlock order in btrfs_ioctl_resize Btrfs: fix "mutually exclusive op is running" error code Btrfs: bring back balance pause/resume logic btrfs: update timestamps on truncate() btrfs: fix btrfs_cont_expand() freeing IS_ERR em Btrfs: fix a bug when llseek for delalloc bytes behind prealloc extents Btrfs: fix off-by-one in lseek ...	2013-01-25 10:55:21 -08:00
Chen Gang	03dafb5f59	ext4: fix memory leak when quota options are specified multiple times When usrjquota or grpjquota mount options are specified several times, we leak memory storing the names. Free the memory correctly. Signed-off-by: Chen Gang <gang.chen@asianux.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>	2013-01-24 23:24:58 -05:00
Theodore Ts'o	72ba74508b	ext4: release sysfs kobject when failing to enable quotas on mount In addition, print the error returned from ext4_enable_quotas() Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Cc: stable@vger.kernel.org	2013-01-24 23:24:54 -05:00
Linus Torvalds	66e2d3e8c2	Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6 Pull cifs fixes from Steve French: "Two small cifs fixes" * 'for-next' of git://git.samba.org/sfrench/cifs-2.6: fs/cifs/cifs_dfs_ref.c: fix potential memory leakage cifs: fix srcip_matches() for ipv6	2013-01-24 19:15:43 -08:00
Miao Xie	1eafa6c737	Btrfs: fix repeated delalloc work allocation btrfs_start_delalloc_inodes() locks the delalloc_inodes list, fetches the first inode, unlocks the list, triggers btrfs_alloc_delalloc_work/ btrfs_queue_worker for this inode, and then it locks the list, checks the head of the list again. But because we don't delete the first inode that it deals with before, it will fetch the same inode. As a result, this function allocates a huge amount of btrfs_delalloc_work structures, and OOM happens. Fix this problem by splice this delalloc list. Reported-by: Alex Lyakas <alex.btrfs@zadarastorage.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-01-24 12:51:27 -05:00
Miao Xie	c9f01bfe0c	Btrfs: fix wrong max device number for single profile The max device number of single profile is 1, not 0 (0 means 'as many as possible'). Fix it. Cc: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-01-24 12:51:26 -05:00
Miao Xie	2cba30f172	Btrfs: fix missed transaction->aborted check First, though the current transaction->aborted check can stop the commit early and avoid unnecessary operations, it is too early, and some transaction handles don't end, those handles may set transaction->aborted after the check. Second, when we commit the transaction, we will wake up some worker threads to flush the space cache and inode cache. Those threads also allocate some transaction handles and may set transaction->aborted if some serious error happens. So we need more check for ->aborted when committing the transaction. Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-01-24 12:51:25 -05:00
Miao Xie	8d25a086eb	Btrfs: Add ACCESS_ONCE() to transaction->abort accesses We may access and update transaction->aborted on the different CPUs without lock, so we need ACCESS_ONCE() wrapper to prevent the compiler from creating unsolicited accesses and make sure we can get the right value. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-01-24 12:51:23 -05:00
Josef Bacik	e58dd74bcc	Btrfs: put csums on the right ordered extent I noticed a WARN_ON going off when adding csums because we were going over the amount of csum bytes that should have been allowed for an ordered extent. This is a leftover from when we used to hold the csums privately for direct io, but now we use the normal ordered sum stuff so we need to make sure and check if we've moved on to another extent so that the csums are added to the right extent. Without this we could end up with csums for bytenrs that don't have extents to cover them yet. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-01-24 12:51:22 -05:00
Liu Bo	192000dda2	Btrfs: use right range to find checksum for compressed extents For compressed extents, the range of checksum is covered by disk length, and the disk length is different with ram length, so we need to use disk length instead to get us the right checksum. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-01-24 12:51:17 -05:00
Josef Bacik	b0175117b9	Btrfs: fix panic when recovering tree log A user reported a BUG_ON(ret) that occured during tree log replay. Ret was -EAGAIN, so what I think happened is that we removed an extent that covered a bitmap entry and an extent entry. We remove the part from the bitmap and return -EAGAIN and then search for the next piece we want to remove, which happens to be an entire extent entry, so we just free the sucker and return. The problem is ret is still set to -EAGAIN so we trip the BUG_ON(). The user used btrfs-zero-log so I'm not 100% sure this is what happened so I've added a WARN_ON() to catch the other possibility. Thanks, Reported-by: Jan Steffens <jan.steffens@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-01-24 12:49:49 -05:00
Josef Bacik	201a903894	Btrfs: do not allow logged extents to be merged or removed We drop the extent map tree lock while we're logging extents, so somebody could come in and merge another extent into this one and screw up our logging, or they could even remove us from the list which would keep us from logging the extent or freeing our ref on it, so we need to make sure to not clear LOGGING until after the extent is logged, and then we can merge it to adjacent extents. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-01-24 12:49:48 -05:00
Dave Chinner	4d559a3bcb	xfs: limit speculative prealloc near ENOSPC thresholds There is a window on small filesytsems where specualtive preallocation can be larger than that ENOSPC throttling thresholds, resulting in specualtive preallocation trying to reserve more space than there is space available. This causes immediate ENOSPC to be triggered, prealloc to be turned off and flushing to occur. One the next write (i.e. next 4k page), we do exactly the same thing, and so effective drive into synchronous 4k writes by triggering ENOSPC flushing on every page while in the window between the prealloc size and the ENOSPC prealloc throttle threshold. Fix this by checking to see if the prealloc size would consume all free space, and throttle it appropriately to avoid premature ENOSPC... Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-01-24 11:08:55 -06:00
Dave Chinner	10616b806d	xfs: fix _xfs_buf_find oops on blocks beyond the filesystem end When _xfs_buf_find is passed an out of range address, it will fail to find a relevant struct xfs_perag and oops with a null dereference. This can happen when trying to walk a filesystem with a metadata inode that has a partially corrupted extent map (i.e. the block number returned is corrupt, but is otherwise intact) and we try to read from the corrupted block address. In this case, just fail the lookup. If it is readahead being issued, it will simply not be done, but if it is real read that fails we will get an error being reported. Ideally this case should result in an EFSCORRUPTED error being reported, but we cannot return an error through xfs_buf_read() or xfs_buf_get() so this lookup failure may result in ENOMEM or EIO errors being reported instead. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-01-24 11:06:41 -06:00
Miklos Szeredi	fb05f41f5f	fuse: cleanup fuse_direct_io() Fix the following sparse warnings: fs/fuse/file.c:1216:43: warning: cast removes address space of expression fs/fuse/file.c:1216:43: warning: incorrect type in initializer (different address spaces) fs/fuse/file.c:1216:43: expected void [noderef] <asn:1>iov_base fs/fuse/file.c:1216:43: got void <noident> fs/fuse/file.c:1241:43: warning: cast removes address space of expression fs/fuse/file.c:1241:43: warning: incorrect type in initializer (different address spaces) fs/fuse/file.c:1241:43: expected void [noderef] <asn:1>iov_base fs/fuse/file.c:1241:43: got void <noident> fs/fuse/file.c:1267:43: warning: cast removes address space of expression fs/fuse/file.c:1267:43: warning: incorrect type in initializer (different address spaces) fs/fuse/file.c:1267:43: expected void [noderef] <asn:1>iov_base fs/fuse/file.c:1267:43: got void <noident> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2013-01-24 16:21:28 +01:00
Maxim Patlasov	5565a9d884	fuse: optimize __fuse_direct_io() __fuse_direct_io() allocates fuse-requests by calling fuse_get_req(fc, n). The patch calculates 'n' based on iov[] array. This is useful because allocating FUSE_MAX_PAGES_PER_REQ page pointers and descriptors for each fuse request would be waste of memory in case of iov-s of smaller size. Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2013-01-24 16:21:28 +01:00
Maxim Patlasov	7c190c8b9c	fuse: optimize fuse_get_user_pages() Let fuse_get_user_pages() pack as many iov-s to a single fuse_req as possible. This is very beneficial in case of iov[] consisting of many iov-s of relatively small sizes (e.g. PAGE_SIZE). Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2013-01-24 16:21:27 +01:00
Maxim Patlasov	b98d023a24	fuse: pass iov[] to fuse_get_user_pages() The patch makes preliminary work for the next patch optimizing scatter-gather direct IO. The idea is to allow fuse_get_user_pages() to pack as many iov-s to each fuse request as possible. So, here we only rework all related call-paths to carry iov[] from fuse_direct_IO() to fuse_get_user_pages(). Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2013-01-24 16:21:27 +01:00
Maxim Patlasov	85f40aec88	fuse: use req->page_descs[] for argpages cases Previously, anyone who set flag 'argpages' only filled req->pages[] and set per-request page_offset. This patch re-works all cases where argpages=1 to fill req->page_descs[] properly. Having req->page_descs[] filled properly allows to re-work fuse_copy_pages() to copy page fragments described by req->page_descs[]. This will be useful for next patches optimizing direct_IO. Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2013-01-24 16:21:27 +01:00
Maxim Patlasov	b2430d7567	fuse: add per-page descriptor <offset, length> to fuse_req The ability to save page pointers along with lengths and offsets in fuse_req will be useful to cover several iovec-s with a single fuse_req. Per-request page_offset is removed because anybody who need it can use req->page_descs[0].offset instead. Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2013-01-24 16:21:27 +01:00
Maxim Patlasov	54b966702d	fuse: rework fuse_do_ioctl() fuse_do_ioctl() already calculates the number of pages it's going to use. It is stored in 'num_pages' variable. So the patch simply uses it for allocating fuse_req. Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2013-01-24 16:21:26 +01:00
Maxim Patlasov	d07f09f509	fuse: rework fuse_perform_write() The patch allocates as many page pointers in fuse_req as needed to cover interval [pos .. pos+len-1]. Inline helper fuse_wr_pages() is introduced to hide this cumbersome arithmetic. Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2013-01-24 16:21:26 +01:00
Maxim Patlasov	f8dbdf8182	fuse: rework fuse_readpages() The patch uses 'nr_pages' argument of fuse_readpages() as heuristics for the number of page pointers to allocate. This can be improved further by taking in consideration fc->max_read and gaps between page indices, but it's not clear whether it's worthy or not. Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2013-01-24 16:21:26 +01:00
Maxim Patlasov	4d53dc99ba	fuse: rework fuse_retrieve() The patch reworks fuse_retrieve() to allocate only so many page pointers as needed. The core part of the patch is the following calculation: num_pages = (num + offset + PAGE_SIZE - 1) >> PAGE_SHIFT; (thanks Miklos for formula). All other changes are mostly shuffling lines. Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2013-01-24 16:21:26 +01:00
Maxim Patlasov	b111c8c0e3	fuse: categorize fuse_get_req() The patch categorizes all fuse_get_req() invocations into two categories: - fuse_get_req_nopages(fc) - when caller doesn't care about req->pages - fuse_get_req(fc, n) - when caller need n page pointers (n > 0) Adding fuse_get_req_nopages() helps to avoid numerous fuse_get_req(fc, 0) scattered over code. Now it's clear from the first glance when a caller need fuse_req with page pointers. The patch doesn't make any logic changes. In multi-page case, it silly allocates array of FUSE_MAX_PAGES_PER_REQ page pointers. This will be amended by future patches. Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2013-01-24 16:21:25 +01:00
Maxim Patlasov	4250c0668e	fuse: general infrastructure for pages[] of variable size The patch removes inline array of FUSE_MAX_PAGES_PER_REQ page pointers from fuse_req. Instead of that, req->pages may now point either to small inline array or to an array allocated dynamically. This essentially means that all callers of fuse_request_alloc[_nofs] should pass the number of pages needed explicitly. The patch doesn't make any logic changes. Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2013-01-24 16:21:25 +01:00
Anand V. Avati	0b05b18381	fuse: implement NFS-like readdirplus support This patch implements readdirplus support in FUSE, similar to NFS. The payload returned in the readdirplus call contains 'fuse_entry_out' structure thereby providing all the necessary inputs for 'faking' a lookup() operation on the spot. If the dentry and inode already existed (for e.g. in a re-run of ls -l) then just the inode attributes timeout and dentry timeout are refreshed. With a simple client->network->server implementation of a FUSE based filesystem, the following performance observations were made: Test: Performing a filesystem crawl over 20,000 files with sh# time ls -lR /mnt Without readdirplus: Run 1: 18.1s Run 2: 16.0s Run 3: 16.2s With readdirplus: Run 1: 4.1s Run 2: 3.8s Run 3: 3.8s The performance improvement is significant as it avoided 20,000 upcalls calls (lookup). Cache consistency is no worse than what already is. Signed-off-by: Anand V. Avati <avati@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2013-01-24 16:21:25 +01:00
J. Bruce Fields	ff89be87c7	nfsd4: require version 4 when enabling or disabling minorversion The current code will allow silly things like: echo "+2 +3 +4 +7.1">/proc/fs/nfsd/versions Reported-by: Fan Chaoting <fanchaoting@cn.fujitsu.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2013-01-23 18:25:01 -05:00
Stanislav Kinsbursky	bca0ec6511	nfsd: fix unused "nn" variable warning in free_client() If CONFIG_LOCKDEP is disabled, then there would be a warning like this: CC [M] fs/nfsd/nfs4state.o fs/nfsd/nfs4state.c: In function ‘free_client’: fs/nfsd/nfs4state.c:1051:19: warning: unused variable ‘nn’ [-Wunused-variable] So, let's add "maybe_unused" tag to this variable. Reported-by: Toralf Förster <toralf.foerster@gmx.de> Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2013-01-23 18:17:40 -05:00
Yanchuan Nian	266533c6df	nfsd: Don't unlock the state while it's not locked In the procedure of CREATE_SESSION, the state is locked after alloc_conn_from_crses(). If the allocation fails, the function goes to "out_free_session", and then "out" where there is an unlock function. Signed-off-by: Yanchuan Nian <ycnian@gmail.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2013-01-23 18:17:37 -05:00
Yanchuan Nian	74b70dded3	nfsd: Pass correct slot number to nfsd4_put_drc_mem() In alloc_session(), numslots is the correct slot number used by the session. But the slot number passed to nfsd4_put_drc_mem() is the one from nfs client. Signed-off-by: Yanchuan Nian <ycnian@gmail.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2013-01-23 18:17:36 -05:00
J. Bruce Fields	84822d0b3b	nfsd4: simplify nfsd4_encode_fattr interface slightly It seems slightly simpler to make nfsd4_encode_fattr rather than its callers responsible for advancing the write pointer on success. (Also: the count == 0 check in the verify case looks superfluous. Running out of buffer space is really the only reason fattr encoding should fail with eresource.) Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2013-01-23 18:17:35 -05:00
Cong Ding	10b8c7dff5	fs/cifs/cifs_dfs_ref.c: fix potential memory leakage When it goes to error through line 144, the memory allocated to devname is not freed, and the caller doesn't free it either in line 250. So we free the memroy of devname in function cifs_compose_mount_options() when it goes to error. Signed-off-by: Cong Ding <dinggnu@gmail.com> CC: stable <stable@kernel.org> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>	2013-01-22 23:58:16 -06:00
Linus Torvalds	262060ea46	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse fixes from Miklos Szeredi: "This contain a bugfix for CUSE and miscellaneous small fixes" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: remove unused variable in fuse_try_move_page() fuse: make fuse_file_fallocate() static fuse: Move CUSE Kconfig entry from fs/Kconfig into fs/fuse/Kconfig cuse: fix uninitialized variable warnings cuse: do not register multiple devices with identical names cuse: use mutex as registration lock instead of spinlocks	2013-01-22 11:53:19 -08:00
Linus Torvalds	05c2cf35c3	f2fs fixes for v3.8-rc5 o Support swap file and link generic_file_remap_pages o Enhance the bio streaming flow and free section control o Major bug fix on recovery routine o Minor bug/warning fixes and code cleanups -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) iQIcBAABAgAGBQJQ/fpIAAoJEEAUqH6CSFDS6BAP/jz/JIPoSu4NunWgVH68tzA4 xx4lsmJQJQG+1241uA9v4MMkcTzxE0QVNTOX3BpUXBhCMc00aPOPWlWjULridLI9 0vRE6LpG+NntkyKF+M5T1ydGuQoDlEvqoGs6c5p3yaI6PxzbzmBmipsmqXUA8fyu 280+OWJoAcALEMJiQ8JsHcDmvM9wdQ+BV/j3BNCm4dqBUA4dYPfDzRKUJYfwqiig qmVRseJJaekxrQ2lHG/K/WPAXa8aRcV6khP9tv/BPGRMt+I/fli/J4sWEFT6c73B +qYmhrf/RLNJ1O13dePlo3URwMu083PL8QN355GAKUJbMaX/UPjEnq0DLbBOkCS2 KBQI5O1eiFauEE6YU7p7GuvnLeVkukcXSQNVnRrnWzTUA9CMThZZ2mAgb2lz+iEP oZWNirRwnxdcTQRXjPyTCtpPTCgJi4GO1WS5s6HeLP8G8Muo4PzDzcEwMX0Mw5ih s4n4wpQ+Zp3h53cc/DxdFAK15uM3XYtUfb92kJwqaEG5VmBy6KvliXDRnNXg7WMI imCb08c0Fr0M8ZHOd+UCveICcndFj25jkjx8w7PoE1KBbGJkKf+cjpZ3OhOAliux sGtH3EZLV6jx1MjV79OpDXQGoVsvktesCbRcaxc7BbqljXYVNiap8QbzjPnmAt6z KKN0GU32eolGgK4Zd/iF =vJQc -----END PGP SIGNATURE----- Merge tag 'f2fs-for-3.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs Pull f2fs fixes from Jaegeuk Kim: o Support swap file and link generic_file_remap_pages o Enhance the bio streaming flow and free section control o Major bug fix on recovery routine o Minor bug/warning fixes and code cleanups * tag 'f2fs-for-3.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (22 commits) f2fs: use _safe() version of list_for_each f2fs: add comments of start_bidx_of_node f2fs: avoid issuing small bios due to several dirty node pages f2fs: support swapfile f2fs: add remap_pages as generic_file_remap_pages f2fs: add __init to functions in init_f2fs_fs f2fs: fix the debugfs entry creation path f2fs: add global mutex_lock to protect f2fs_stat_list f2fs: remove the blk_plug usage in f2fs_write_data_pages f2fs: avoid redundant time update for parent directory in f2fs_delete_entry f2fs: remove redundant call to set_blocksize in f2fs_fill_super f2fs: move f2fs_balance_fs to punch_hole f2fs: add f2fs_balance_fs in several interfaces f2fs: revisit the f2fs_gc flow f2fs: check return value during recovery f2fs: avoid null dereference in f2fs_acl_from_disk f2fs: initialize newly allocated dnode structure f2fs: update f2fs partition info about SIT/NAT layout f2fs: update f2fs document to reflect SIT/NAT layout correctly f2fs: remove unneeded INIT_LIST_HEAD at few places ...	2013-01-22 10:33:17 -08:00
Namjae Jeon	99600051b0	udf: add extent cache support in case of file reading This patch implements extent caching in case of file reading. While reading a file, currently, UDF reads metadata serially which takes a lot of time depending on the number of extents present in the file. Caching last accessd extent improves metadata read time. Instead of reading file metadata from start, now we read from the cached extent. This patch considerably improves the time spent by CPU in kernel mode. For example, while reading a 10.9 GB file using dd: Time before applying patch: 11677022208 bytes (10.9GB) copied, 1529.748921 seconds, 7.3MB/s real 25m 29.85s user 0m 12.41s sys 15m 34.75s Time after applying patch: 11677022208 bytes (10.9GB) copied, 1469.338231 seconds, 7.6MB/s real 24m 29.44s user 0m 15.73s sys 3m 27.61s [JK: Fix bh refcounting issues, simplify initialization] Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com> Signed-off-by: Bonggil Bak <bgbak@samsung.com> Signed-off-by: Jan Kara <jack@suse.cz>	2013-01-22 10:48:31 +01:00
Dan Carpenter	d8b79b2f94	f2fs: use _safe() version of list_for_each This is calling list_del() inside a loop which is a problem when we try move to the next item on the list. I've converted it to use the _safe version. And also, as a cleanup, I've converted it to use list_for_each_entry instead of list_for_each. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>	2013-01-22 10:49:00 +09:00
Jaegeuk Kim	9af45ef5ab	f2fs: add comments of start_bidx_of_node The caller of start_bidx_of_node() should give proper node offsets which point only direct node blocks. Otherwise, it is a caller's bug. This patch adds comments to make it clear. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>	2013-01-22 10:48:59 +09:00
Jaegeuk Kim	a7fdffbd3e	f2fs: avoid issuing small bios due to several dirty node pages If some small bios of dirty node pages are supposed to be issued during the sequential data writes, there-in well-produced consecutive data bios are able to be split by the small node bios, resulting in performance degradation. So, let's collect a number of dirty node pages until reaching a threshold. And, by default, I set the threshold as 2MB, a segment size. This improves sequential write performance on i5, 512GB SSD (830 w/ SATA2) as follows. Before: 231 MB/s -> After: 255 MB/s Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com> Reviewed-by: Namjae Jeon <namjae.jeon@samsung.com>	2013-01-22 10:48:59 +09:00
Jaegeuk Kim	c01e54b770	f2fs: support swapfile This patch adds f2fs_bmap operation to the data address space. This enables f2fs to support swapfile. Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>	2013-01-22 10:48:58 +09:00
Jaegeuk Kim	692bb55d1a	f2fs: add remap_pages as generic_file_remap_pages This was added for all the file systems before. See the following commit. commit id: `0b173bc4da` [PATCH] mm: kill vma flag VM_CAN_NONLINEAR This patch moves actual ptes filling for non-linear file mappings into special vma operation: ->remap_pages(). File system must implement this method to get non-linear mappings support, if it uses filemap_fault() then generic_file_remap_pages() can be used. Now device drivers can implement this method and obtain nonlinear vma support." Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>	2013-01-22 10:48:58 +09:00
Namjae Jeon	6e6093a8f1	f2fs: add __init to functions in init_f2fs_fs Add __init to functions in init_f2fs_fs for code consistency. Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>	2013-01-22 10:48:38 +09:00
Ilya Dryomov	a105bb88f4	Btrfs: fix a regression in balance usage filter Commit `3fed40cc` ("Btrfs: cleanup duplicated division functions"), which was merged into 3.8-rc1, has introduced a regression by removing logic that was guarding us against bad user input. Bring it back. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-01-21 20:40:27 -05:00
Chris Mason	83bfccb5c0	Merge branch 'mutex-ops@next-for-chris' of git://github.com/idryomov/btrfs-unstable into linus	2013-01-21 20:39:06 -05:00
Chris Mason	daf2c08911	Merge branch 'for-chris' of git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next into linus	2013-01-21 20:26:55 -05:00
Arne Jansen	2cf6870396	Btrfs: prevent qgroup destroy when there are still relations Currently you can just destroy a qgroup even though it is in use by other qgroups or has qgroups assigned to it. This patch prevents destruction of qgroups unless they are completely unused. Otherwise destroy will return EBUSY. Reported-by: Eric Hopper <hopper@omnifarious.org> Signed-off-by: Arne Jansen <sensille@gmx.net> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-01-21 20:18:11 -05:00
Arne Jansen	ff24858c65	Btrfs: ignore orphan qgroup relations If a qgroup that has still assignments is deleted by the user, the corresponding relations are left in the tree. This leads to an unmountable filesystem. With this patch, those relations are simple ignored. Reported-by: Eric Hopper <hopper@omnifarious.org> Signed-off-by: Arne Jansen <sensille@gmx.net> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-01-21 20:18:11 -05:00
Kees Cook	b7e17a10ed	fs/ufs: remove depends on CONFIG_EXPERIMENTAL The CONFIG_EXPERIMENTAL config item has not carried much meaning for a while now and is almost always enabled by default. As agreed during the Linux kernel summit, remove it from any "depends on" lines in Kconfigs. CC: Evgeniy Dushistov <dushistov@mail.ru> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2013-01-21 14:39:06 -08:00
Kees Cook	f987c90257	fs/nfsd: remove depends on CONFIG_EXPERIMENTAL The CONFIG_EXPERIMENTAL config item has not carried much meaning for a while now and is almost always enabled by default. As agreed during the Linux kernel summit, remove it from any "depends on" lines in Kconfigs. CC: "J. Bruce Fields" <bfields@fieldses.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2013-01-21 14:39:05 -08:00
Kees Cook	03edef0fea	fs/logfs: remove depends on CONFIG_EXPERIMENTAL The CONFIG_EXPERIMENTAL config item has not carried much meaning for a while now and is almost always enabled by default. As agreed during the Linux kernel summit, remove it from any "depends on" lines in Kconfigs. CC: Joern Engel <joern@logfs.org> CC: Prasad Joshi <prasadjoshi.linux@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2013-01-21 14:39:05 -08:00
Kees Cook	cf98c5e568	fs/jffs2: remove depends on CONFIG_EXPERIMENTAL The CONFIG_EXPERIMENTAL config item has not carried much meaning for a while now and is almost always enabled by default. As agreed during the Linux kernel summit, remove it from any "depends on" lines in Kconfigs. CC: David Woodhouse <dwmw2@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2013-01-21 14:39:05 -08:00
Kees Cook	adae07485e	fs/hfs: remove depends on CONFIG_EXPERIMENTAL The CONFIG_EXPERIMENTAL config item has not carried much meaning for a while now and is almost always enabled by default. As agreed during the Linux kernel summit, remove it from any "depends on" lines in Kconfigs. Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2013-01-21 14:39:05 -08:00
Kees Cook	462f16a557	fs/efs: remove depends on CONFIG_EXPERIMENTAL The CONFIG_EXPERIMENTAL config item has not carried much meaning for a while now and is almost always enabled by default. As agreed during the Linux kernel summit, remove it from any "depends on" lines in Kconfigs. Cc: Neil Brown <neilb@suse.de> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2013-01-21 14:39:05 -08:00
Kees Cook	00f3616b25	fs/cifs: remove depends on CONFIG_EXPERIMENTAL The CONFIG_EXPERIMENTAL config item has not carried much meaning for a while now and is almost always enabled by default. As agreed during the Linux kernel summit, remove it from any "depends on" lines in Kconfigs. CC: Steve French <sfrench@samba.org> CC: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2013-01-21 14:39:05 -08:00
Kees Cook	38db331b57	fs/btrfs: remove depends on CONFIG_EXPERIMENTAL The CONFIG_EXPERIMENTAL config item has not carried much meaning for a while now and is almost always enabled by default. As agreed during the Linux kernel summit, remove it from any "depends on" lines in Kconfigs. Cc: Chris Mason <chris.mason@fusionio.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2013-01-21 14:39:05 -08:00
Kees Cook	d0e09c80f0	fs/bfs: remove depends on CONFIG_EXPERIMENTAL The CONFIG_EXPERIMENTAL config item has not carried much meaning for a while now and is almost always enabled by default. As agreed during the Linux kernel summit, remove it from any "depends on" lines in Kconfigs. Cc: "Tigran A. Aivazian" <tigran@aivazian.fsnet.co.uk> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2013-01-21 14:39:04 -08:00
Kees Cook	5c8e0226e7	fs/befs: remove depends on CONFIG_EXPERIMENTAL The CONFIG_EXPERIMENTAL config item has not carried much meaning for a while now and is almost always enabled by default. As agreed during the Linux kernel summit, remove it from any "depends on" lines in Kconfigs. Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2013-01-21 14:39:04 -08:00
Kees Cook	8dbd5d6df3	fs/afs: remove depends on CONFIG_EXPERIMENTAL The CONFIG_EXPERIMENTAL config item has not carried much meaning for a while now and is almost always enabled by default. As agreed during the Linux kernel summit, remove it from any "depends on" lines in Kconfigs. CC: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2013-01-21 14:39:04 -08:00
Kees Cook	6d7a19fa74	fs/affs: remove depends on CONFIG_EXPERIMENTAL The CONFIG_EXPERIMENTAL config item has not carried much meaning for a while now and is almost always enabled by default. As agreed during the Linux kernel summit, remove it from any "depends on" lines in Kconfigs. Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2013-01-21 14:39:04 -08:00
Kees Cook	acd4fd07d4	fs/adfs: remove depends on CONFIG_EXPERIMENTAL The CONFIG_EXPERIMENTAL config item has not carried much meaning for a while now and is almost always enabled by default. As agreed during the Linux kernel summit, remove it from any "depends on" lines in Kconfigs. Acked-by: Stuart Swales <stuart.swales.croftnuisk@gmail.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2013-01-21 14:39:04 -08:00
Kees Cook	c904197533	fs/9p: remove depends on CONFIG_EXPERIMENTAL The CONFIG_EXPERIMENTAL config item has not carried much meaning for a while now and is almost always enabled by default. As agreed during the Linux kernel summit, remove it from any "depends on" lines in Kconfigs. CC: Eric Van Hensbergen <ericvh@gmail.com> CC: Ron Minnich <rminnich@sandia.gov> CC: Latchesar Ionkov <lucho@ionkov.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2013-01-21 14:39:04 -08:00
Jan Kara	9734c971aa	udf: Write LVID to disk after opening / closing So far we just marked the buffer as dirty and left writing on flusher thread but especially on opening that opens possible race window where we could write other modified fs structures to disk before we mark filesystem as open. So sync LVID buffer to disk after opening and closing fs. Reported-by: Steve Nickel <snickel58@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>	2013-01-21 11:19:58 +01:00
Wang Shilong	c04e88e271	Ext3: return ENOMEM rather than EIO if sb_getblk fails It will be better to use ENOMEM rather than EIO, because the only reason that sb_getblk fails is that allocation fails. Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com> Signed-off-by: Jan Kara <jack@suse.cz>	2013-01-21 11:19:57 +01:00
Wang Shilong	ab6a773dbc	Ext2: return ENOMEM rather than EIO if sb_getblk fails As the only reason that sb_getblks fails is that allocation fails. It will be better to use ENOMEM rather than EIO. Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com> Signed-off-by: Jan Kara <jack@suse.cz>	2013-01-21 11:19:57 +01:00
Wang Shilong	1b7d76e9b1	Ext3: use unlikely to improve the efficiency of the kernel Because the function 'sb_getblk' seldomly fails to return NULL value,it will be better to use unlikely to check it. Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com> Signed-off-by: Jan Kara <jack@suse.cz>	2013-01-21 11:19:57 +01:00
Wang Shilong	2b0542a4a0	Ext2: use unlikely to improve the efficiency of the kernel Because the function 'sb_getblk' seldomly fails to return NULL value. It will be better to use unlikely to optimize it. Signed-off-by: Wang shilong <wangsl-fnst@cn.fujitsu.com> Signed-off-by: Jan Kara <jack@suse.cz>	2013-01-21 11:19:56 +01:00
Wang Shilong	61f43e6880	Ext3: add necessary check in case IO error happens As we know io error may happen when the function 'sb_getblk' is called.Add necessary check for it The patch also fix a coding style problem. Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com> Signed-off-by: Jan Kara <jack@suse.cz>	2013-01-21 11:19:56 +01:00
Wang Shilong	8d8759eb48	Ext2: free memory allocated and forget buffer head when io error happens Add a necessary check when an io error happens. If io error happens,free the memory allocated and forget buffer head. Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com> Signed-off-by: Jan Kara <jack@suse.cz>	2013-01-21 11:19:56 +01:00
Jan Kara	f56426ae4d	ext3: Fix memory leak when quota options are specified multiple times When usrjquota or grpjquota mount options are specified several times, we leak memory storing the names. Free the memory correctly. Reported-by: Chen Gang <gang.chen@asianux.com> Signed-off-by: Jan Kara <jack@suse.cz>	2013-01-21 11:19:55 +01:00
Guo Chao	306a74920b	ext3, ext4, ocfs2: remove unused macro NAMEI_RA_INDEX This macro, initially introduced by ext2 in v0.99.15, does not have any users from the beginning. It has been removed in later ext2 version but still remains in the code of ext3, ext4, ocfs2. Remove this macro there. Cc: Jan Kara <jack@suse.cz> Cc: linux-ext4@vger.kernel.org Cc: ocfs2-devel@oss.oracle.com Acked-by: Mark Fasheh <mfasheh@suse.de> Acked-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com> Signed-off-by: Jan Kara <jack@suse.cz>	2013-01-21 11:19:55 +01:00
Nickolai Zeldovich	e3e2775ced	cifs: fix srcip_matches() for ipv6 srcip_matches() previously had code like this: srcip_matches(..., struct sockaddr rhs) { / ... / struct sockaddr_in6 vaddr6 = (struct sockaddr_in6 *) &rhs; return ipv6_addr_equal(..., &vaddr6->sin6_addr); } which interpreted the values on the stack after the 'rhs' pointer as an ipv6 address. The correct thing to do is to use 'rhs', not '&rhs'. Signed-off-by: Nickolai Zeldovich <nickolai@csail.mit.edu> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2013-01-21 01:37:26 -06:00
Ilya Dryomov	25122d15e2	Btrfs: reorder locks and sanity checks in btrfs_ioctl_defrag Operation-specific check (whether subvol is readonly or not) should go after the mutual exclusiveness check. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2013-01-20 16:21:22 +02:00
Ilya Dryomov	4ac20c70b0	Btrfs: fix unlock order in btrfs_ioctl_rm_dev Fix unlock order in btrfs_ioctl_rm_dev(). Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2013-01-20 16:21:21 +02:00
Ilya Dryomov	18f39c416d	Btrfs: fix unlock order in btrfs_ioctl_resize Fix unlock order in btrfs_ioctl_resize(). Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2013-01-20 16:21:20 +02:00
Ilya Dryomov	2c0c9da02a	Btrfs: fix "mutually exclusive op is running" error code The error code that is returned in response to starting a mutually exclusive operation when there is one already running got silently changed from EINVAL to EINPROGRESS by `5ac00add`. Returning EINPROGRESS to, say, add_dev, when rm_dev is running is misleading. Furthermore, the operation itself may want to use EINPROGRESS for other purposes. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2013-01-20 16:21:18 +02:00
Ilya Dryomov	ed0fb78fb6	Btrfs: bring back balance pause/resume logic Balance pause/resume logic got broken by `5ac00add` (went in into 3.8-rc1 as part of dev-replace merge). Offending commit took a stab at making mutually exclusive volume operations (add_dev, rm_dev, resize, balance, replace_dev) not block behind volume_mutex if another such operation is in progress and instead return an error right away. Balancing front-end relied on the blocking behaviour, so the fix is ugly, but short of a complete rework, it's the best we can do. Reported-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2013-01-20 16:21:17 +02:00
Joe Millenbach	4f73bc4dd3	tty: Added a CONFIG_TTY option to allow removal of TTY The option allows you to remove TTY and compile without errors. This saves space on systems that won't support TTY interfaces anyway. bloat-o-meter output is below. The bulk of this patch consists of Kconfig changes adding "depends on TTY" to various serial devices and similar drivers that require the TTY layer. Ideally, these dependencies would occur on a common intermediate symbol such as SERIO, but most drivers "select SERIO" rather than "depends on SERIO", and "select" does not respect dependencies. bloat-o-meter output comparing our previous minimal to new minimal by removing TTY. The list is filtered to not show removed entries with awk '$3 != "-"' as the list was very long. add/remove: 0/226 grow/shrink: 2/14 up/down: 6/-35356 (-35350) function old new delta chr_dev_init 166 170 +4 allow_signal 80 82 +2 static.__warned 143 142 -1 disallow_signal 63 62 -1 __set_special_pids 95 94 -1 unregister_console 126 121 -5 start_kernel 546 541 -5 register_console 593 588 -5 copy_from_user 45 40 -5 sys_setsid 128 120 -8 sys_vhangup 32 19 -13 do_exit 1543 1526 -17 bitmap_zero 60 40 -20 arch_local_irq_save 137 117 -20 release_task 674 652 -22 static.spin_unlock_irqrestore 308 260 -48 Signed-off-by: Joe Millenbach <jmillenbach@gmail.com> Reviewed-by: Jamey Sharp <jamey@minilop.net> Reviewed-by: Josh Triplett <josh@joshtriplett.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2013-01-18 16:15:27 -08:00
Ben Myers	003fd6c8be	xfs: fix fs/xfs/xfs_log.c:1740:39: error: 'B_TRUE' undeclared Commit `667a9291c5` "xfs: Remove boolean_t typedef completely." didn't. Remove a stray B_TRUE that breaks CONFIG_XFS_DEBUG=y. Signed-off-by: Ben Myers <bpm@sgi.com> Reported-by: Wu Fengguang <fengguang.wu@intel.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com>	2013-01-18 15:11:57 -06:00
Greg Kroah-Hartman	ed408f7c0f	Merge 3.9-rc4 into driver-core-next This is to fix up a build problem with a wireless driver due to the dynamic-debug patches in this branch. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2013-01-17 19:48:18 -08:00
Brian Foster	9e96fe6df4	xfs: pull up stack_switch check into xfs_bmapi_write The stack_switch check currently occurs in __xfs_bmapi_allocate, which means the stack switch only occurs when xfs_bmapi_allocate() is called in a loop. Pull the check up before the loop in xfs_bmapi_write() such that the first iteration of the loop has consistent behavior. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-01-17 17:53:37 -06:00
Thiago Farina	667a9291c5	xfs: Remove boolean_t typedef completely. Since we are using C99 we have one builtin defined in include/linux/types.h, use that instead. v2: you missed one in fs/xfs/xfs_qm_bhv.c, cleaned up. -bpm Signed-off-by: Thiago Farina <tfarina@chromium.org> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-01-17 17:32:57 -06:00
Alex Elder	e8afad656c	libceph: pass length to ceph_calc_file_object_mapping() ceph_calc_file_object_mapping() takes (among other things) a "file" offset and length, and based on the layout, determines the object number ("bno") backing the affected portion of the file's data and the offset into that object where the desired range begins. It also computes the size that should be used for the request--either the amount requested or something less if that would exceed the end of the object. This patch changes the input length parameter in this function so it is used only for input. That is, the argument will be passed by value rather than by address, so the value provided won't get updated by the function. The value would only get updated if the length would surpass the current object, and in that case the value it got updated to would be exactly that returned in *oxlen. Only one of the two callers is affected by this change. Update ceph_calc_raw_layout() so it records any updated value. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-01-17 15:52:04 -06:00
Greg Kroah-Hartman	595e0eb067	Revert "sysfs: Convert print_symbol to %pSR" This reverts commit `6ad58fa82d` as %pSR isn't in the tree yet. Cc: Joe Perches <joe@perches.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2013-01-17 13:09:57 -08:00
Sasha Levin	1884bd4b14	debugfs: remove redundant initialization of dentry We already initialize it to NULL when declaring it, no need to do that twice. Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2013-01-17 13:02:08 -08:00
Joe Perches	6ad58fa82d	sysfs: Convert print_symbol to %pSR Use the new vsprintf extension to avoid any possible message interleaving. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2013-01-17 12:57:07 -08:00
Bin Wang	6b8fbde418	sysfs: Fixed a trailing white space error This patch removes the trailing white space in fs/sysfs/mount.c. Signed-off-by: Bin Wang <wbin00@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2013-01-17 12:54:47 -08:00
Yan, Zheng	390306c38d	ceph: check mds_wanted for imported cap The MDS may have incorrect wanted caps after importing caps. So the client should check the value mds has and send cap update if necessary. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Sage Weil <sage@inktank.com>	2013-01-17 12:42:38 -06:00
Yan, Zheng	66f58691c5	ceph: allocate cap_release message when receiving cap import When client wants to release an imported cap, it's possible there is no reserved cap_release message in corresponding mds session. so __queue_cap_release causes kernel panic. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Sage Weil <sage@inktank.com>	2013-01-17 12:42:38 -06:00
Yan, Zheng	395c312b9c	ceph: allow revoking duplicated caps issued by non-auth MDS Allow revoking duplicated caps issued by non-auth MDS if these caps are also issued by auth MDS. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Sage Weil <sage@inktank.com>	2013-01-17 12:42:38 -06:00
Yan, Zheng	8a92a119b2	ceph: move dirty inode to migrating list when clearing auth caps Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Sage Weil <sage@inktank.com>	2013-01-17 12:42:37 -06:00
Sam Lang	6e8575faa8	ceph: Check for created flag in response from mds The mds now sends back a created inode if the create request performed the create. If the file already existed, no inode is returned in the reply. This allows ceph to set the created flag in atomic_open so that permissions are properly checked in the case that the file wasn't created by the create call to the mds. To ensure compability with previous kernels, a feature for sending back the inode in the create reply was added, so that the mds will only send back the inode if the client indicates it supports the feature. Signed-off-by: Sam Lang <sam.lang@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>	2013-01-17 12:42:36 -06:00
Sam Lang	79aec9844d	ceph: Check for err on mds request in atomic_open The error returned by ceph_mdsc_do_request includes errors sending the request, errors on timeout, or any errors coming from the mds. If ceph_mdsc_do_request returns an error, the reply struct will most likely be bogus. We need to bail out and propogate the error instead of overwriting it. Signed-off-by: Sam Lang <sam.lang@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>	2013-01-17 12:42:36 -06:00
Wei Yongjun	8f706111a8	fuse: remove unused variable in fuse_try_move_page() The variables mapping,index are initialized but never used otherwise, so remove the unused variables. dpatch engine is used to auto generate this patch. (https://github.com/weiyj/dpatch) Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2013-01-17 13:09:59 +01:00
Miklos Szeredi	cdadb11cef	fuse: make fuse_file_fallocate() static Fix the following sparse warning: fs/fuse/file.c:2249:6: warning: symbol 'fuse_file_fallocate' was not declared. Should it be static? Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2013-01-17 13:09:47 +01:00
Robert P. J. Day	807185eb3e	fuse: Move CUSE Kconfig entry from fs/Kconfig into fs/fuse/Kconfig Given that CUSE depends on FUSE, it only makes sense to move its Kconfig entry into the FUSE Kconfig file. Also, add a few grammatical and semantic touchups. Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2013-01-17 13:08:45 +01:00
Miklos Szeredi	e2560362cc	cuse: fix uninitialized variable warnings Fix the following compiler warnings: fs/fuse/cuse.c: In function 'cuse_process_init_reply': fs/fuse/cuse.c:288:24: warning: 'val' may be used uninitialized in this function [-Wmaybe-uninitialized] fs/fuse/cuse.c:272:14: note: 'val' was declared here fs/fuse/cuse.c:284:10: warning: 'key' may be used uninitialized in this function [-Wmaybe-uninitialized] fs/fuse/cuse.c:272:8: note: 'key' was declared here Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2013-01-17 13:05:52 +01:00
David Herrmann	30783587b0	cuse: do not register multiple devices with identical names Sysfs doesn't allow two devices with the same name, but we register a sysfs entry for each cuse device without checking for name collisions. This extends the registration to first check whether the name was already registered. To avoid race-conditions between the name-check and linking the device, we need to protect the whole registration with a mutex. Signed-off-by: David Herrmann <dh.herrmann@googlemail.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2013-01-17 13:04:57 +01:00
David Herrmann	8ce03fd76d	cuse: use mutex as registration lock instead of spinlocks We need to check for name-collisions during cuse-device registration. To avoid race-conditions, this needs to be protected during the whole device registration. Therefore, replace the spinlocks by mutexes first so we can safely extend the locked regions to include more expensive or sleeping code paths. Signed-off-by: David Herrmann <dh.herrmann@googlemail.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2013-01-17 13:04:51 +01:00
Philippe De Muyter	4de80273a2	fs/jfs: Fix typo in comment : 'how may' -> 'how many' Signed-off-by: Philippe De Muyter <phdm@macqel.be> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2013-01-17 10:18:02 +01:00
Zheng Liu	aaddea812c	ext4: add tracepoint in punching hole This patch adds a tracepoint in ext4_punch_hole. CC: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2013-01-16 20:21:26 -05:00
Linus Torvalds	dfdebc2483	xfs: bugfixes for 3.8-rc4 - fix(es) for compound buffers - fix for dquot soft timer asserts due to overflow of d_blk_softlimit - fix for regression in dir v2 code introduced in commit `20f7e9f3` -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) iQIcBAABAgAGBQJQ9zKnAAoJENaLyazVq6ZORGcP/RemqCHJEw0a89Y0tLLLAcz/ Es97kJMESdvi3gX3JTdz3vC8LP21dSCR3k3MvVgucb8RsvGoiLixrmluIRxKb79M DEmz9YJ/qxFIpnM9y46VxCYV+/ezxUDEv68wA6T2wJbof26nTLlTj2gAgqjvyWiF R1c1OmdCsTfA257UvxfxSVixVnWv7E2io2ZXUGsrBkP4J9OMaMtn00UYOuP1YL8S NJ44z9QAzTqVEbAfGeaeV/QVUJzMj/IqWCwF7YKEhfmccO/tPyN0+nMG2DI0Fp5e cYGsi4JnaFbqE6Aa/7mu3kv8lYnPe0n3t9d3EwzxOEx+PAvuY8N0EW8Qa4c+805n zXFvAroLgP0jYEEuIfEGYIwDPxG0xjor6ztu8e2twcIj6cDHzSpeYaDPnYvWJlwu FiupnVu+3FX6mVY1jCealI47nOwzM12R7nXysqF3F6Sf95xGJtG3BoTIKioNqk1g dzJGMQvwg/WLvquYb9W/ZNb1T314R23wdYtmI7gWJ74z9IQqWCZBWFYyBhQ8y1Pr Vf3LFjzqNqqnYNzoe8Wnn9wKQ57Es7onAo34Y9HZCOkslZsn5nKriNTXNN6Q9Upc 5RKvj1CbTpKAJYrrhWryI1HtlDKqqtMFdmRQulSu+O9ZJuWZh4XNTu4t3oewt0Ac 5otZwOdk53V3tGxt3prs =gA4q -----END PGP SIGNATURE----- Merge tag 'for-linus-v3.8-rc4' of git://oss.sgi.com/xfs/xfs Pull xfs bugfixes from Ben Myers: - fix(es) for compound buffers - fix for dquot soft timer asserts due to overflow of d_blk_softlimit - fix for regression in dir v2 code introduced in commit `20f7e9f372` ("xfs: factor dir2 block read operations") * tag 'for-linus-v3.8-rc4' of git://oss.sgi.com/xfs/xfs: xfs: recalculate leaf entry pointer after compacting a dir2 block xfs: remove int casts from debug dquot soft limit timer asserts xfs: fix the multi-segment log buffer format xfs: fix segment in xfs_buf_item_format_segment xfs: rename bli_format to avoid confusion with bli_formats xfs: use b_maps[] for discontiguous buffers	2013-01-16 16:19:54 -08:00
Eric Sandeen	aeb4f20a02	xfs: Do not return EFSCORRUPTED when filesystem probe finds no XFS magic `9802182` changed the return value from EWRONGFS (aka EINVAL) to EFSCORRUPTED which doesn't seem to be handled properly by the root filesystem probe. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Tested-by: Sergei Trofimovich <slyfox@gentoo.org> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-01-16 17:33:53 -06:00
Eric Sandeen	37f13561de	xfs: recalculate leaf entry pointer after compacting a dir2 block Dave Jones hit this assert when doing a compile on recent git, with CONFIG_XFS_DEBUG enabled: XFS: Assertion failed: (char )dup - (char )hdr == be16_to_cpu(xfs_dir2_data_unused_tag_p(dup)), file: fs/xfs/xfs_dir2_data.c, line: 828 Upon further digging, the tag found by xfs_dir2_data_unused_tag_p(dup) contained "2" and not the proper offset, and I found that this value was changed after the memmoves under "Use a stale leaf for our new entry." in xfs_dir2_block_addname(), i.e. memmove(&blp[mid + 1], &blp[mid], (highstale - mid) sizeof(*blp)); overwrote it. What has happened is that the previous call to xfs_dir2_block_compact() has rearranged things; it changes btp->count as well as the blp array. So after we make that call, we must recalculate the proper pointer to the leaf entries by making another call to xfs_dir2_block_leaf_p(). Dave provided a metadump image which led to a simple reproducer (create a particular filename in the affected directory) and this resolves the testcase as well as the bug on his live system. Thanks also to dchinner for looking at this one with me. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Tested-by: Dave Jones <davej@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-01-16 16:08:55 -06:00
Brian Foster	ab7eac2200	xfs: remove int casts from debug dquot soft limit timer asserts The int casts here make it easy to trigger an assert with a large soft limit. For example, set a >4TB soft limit on an empty volume to reproduce a (0 > -x) comparison due to an overflow of d_blk_softlimit. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-01-16 16:08:40 -06:00
Mark Tinguely	91e4bac0b7	xfs: fix the multi-segment log buffer format Per Dave Chinner suggestion, this patch: 1) Corrects the detection of whether a multi-segment buffer is still tracking data. 2) Clears all the buffer log formats for a multi-segment buffer. Signed-off-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-01-16 16:08:08 -06:00
Mark Tinguely	2d0e9df579	xfs: fix segment in xfs_buf_item_format_segment Not every segment in a multi-segment buffer is dirty in a transaction and they will not be outputted. The assert in xfs_buf_item_format_segment() that checks for the at least one chunk of data in the segment to be used is not necessary true for multi-segmented buffers. Signed-off-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-01-16 16:07:56 -06:00
Mark Tinguely	0f22f9d0cd	xfs: rename bli_format to avoid confusion with bli_formats Rename the bli_format structure to __bli_format to avoid accidently confusing them with the bli_formats pointer. Signed-off-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-01-16 16:07:37 -06:00
Mark Tinguely	d44d9bc68e	xfs: use b_maps[] for discontiguous buffers Commits starting at `77c1a08` introduced a multiple segment support to xfs_buf. xfs_trans_buf_item_match() could not find a multi-segment buffer in the transaction because it was looking at the single segment block number rather than the multi-segment b_maps[0].bm.bn. This results on a recursive buffer lock that can never be satisfied. This patch: 1) Changed the remaining b_map accesses to be b_maps[0] accesses. 2) Renames the single segment b_map structure to __b_map to avoid future confusion. Signed-off-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-01-16 16:07:11 -06:00
Linus Torvalds	31db720643	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull ext3 and udf fixes from Jan Kara: "One ext3 performance regression fix and one udf regression fix (oops on interrupted mount)." * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: UDF: Fix a null pointer dereference in udf_sb_free_partitions jbd: don't wake kjournald unnecessarily	2013-01-16 10:55:10 -08:00
Kees Cook	1e817fb62c	time: create __getnstimeofday for WARNless calls The pstore RAM backend can get called during resume, and must be defensive against a suspended time source. Expose getnstimeofday logic that returns an error instead of a WARN. This can be detected and the timestamp can be zeroed out. Reported-by: Doug Anderson <dianders@chromium.org> Cc: John Stultz <johnstul@us.ibm.com> Cc: Anton Vorontsov <anton.vorontsov@linaro.org> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: John Stultz <john.stultz@linaro.org>	2013-01-15 18:16:02 -08:00
Akinobu Mita	3d251a5b9e	UBIFS: rename random32() to prandom_u32() Use more preferable function name which implies using a pseudo-random number generator. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>	2013-01-15 15:45:27 +02:00
Namjae Jeon	4589d25d01	f2fs: fix the debugfs entry creation path As the "status" debugfs entry will be maintained for entire F2FS filesystem irrespective of the number of partitions. So, we can move the initialization to the init part of the f2fs and destroy will be done from exit part. After making changes, for individual partition mount - entry creation code will not be executed. Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>	2013-01-15 20:19:15 +09:00
majianpeng	66af62ce75	f2fs: add global mutex_lock to protect f2fs_stat_list There is an race condition between umounting f2fs and reading f2fs/status, which results in oops. Fox example: Thread A Thread B umount f2fs cat f2fs/status f2fs_destroy_stats() { stat_show() { list_for_each_entry_safe(&f2fs_stat_list) list_del(&si->stat_list); mutex_lock(&si->stat_lock); si->sbi = NULL; mutex_unlock(&si->stat_lock); kfree(sbi->stat_info); } mutex_lock(&si->stat_lock) <- si is gone. ... } Solution with a global lock: f2fs_stat_mutex: Thread A Thread B umount f2fs cat f2fs/status f2fs_destroy_stats() { stat_show() { mutex_lock(&f2fs_stat_mutex); list_del(&si->stat_list); mutex_unlock(&f2fs_stat_mutex); kfree(sbi->stat_info); mutex_lock(&f2fs_stat_mutex); } list_for_each_entry_safe(&f2fs_stat_list) ... mutex_unlock(&f2fs_stat_mutex); } Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> [jaegeuk.kim@samsung.com: fix typos, description, and remove the existing lock] Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>	2013-01-15 20:18:29 +09:00
Namjae Jeon	fa9150a84c	f2fs: remove the blk_plug usage in f2fs_write_data_pages Let's consider the usage of blk_plug in f2fs_write_data_pages(). We can come up with the two issues: lock contention and task awareness. 1. Merging bios prior to grabing "queue lock" The f2fs merges consecutive IOs in the file system level before submitting any bios, which is similar with the back merge by the plugging mechanism in attempt_plug_merge(). Both of them need to acquire no queue lock. 2. Merging policy with respect to tasks The f2fs merges IOs as much as possible regardless of tasks, while blk-plugging is conducted on a basis of tasks. As we can understand there are trade-offs, f2fs tries to maximize the write performance with well-merged bios. As a result, if f2fs produces many consecutive but separated bios in writepages(), it would be good to use blk-plugging since f2fs would be able to avoid queue lock contention in the block layer by merging them. But, f2fs merges IOs and submit one bio, which means that there are not much chances to merge bios by attempt_plug_merge(). However, f2fs has already been used blk_plug by triggering generic_writepages() in f2fs_write_data_pages(). So to make the overall code consistency, I'd like to remove blk_plug there. Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>	2013-01-15 20:18:16 +09:00
Namjae Jeon	1b1baff6e5	UDF: Fix a null pointer dereference in udf_sb_free_partitions This patch fixes a regression caused by commit `bff943af6f` "udf: Fix memory leak when mounting" due to which it was triggering a kernel null point dereference in case of interrupted mount OR when allocating memory to sbi->s_partmaps failed in function udf_sb_alloc_partition_maps. Reported-and-tested-by: James Hogan <james@albanarts.com> Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com> Signed-off-by: Jan Kara <jack@suse.cz>	2013-01-14 22:53:47 +01:00
Eric Sandeen	7e2fb2d7e6	jbd: don't wake kjournald unnecessarily Don't send an extra wakeup to kjournald in the case where we already have the proper target in j_commit_request, i.e. that commit has already been requested for commit. commit `d9b0193` "jbd: fix fsync() tid wraparound bug" changed the logic leading to a wakeup, but it caused some extra wakeups which were found to lead to a measurable performance regression. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz>	2013-01-14 22:50:45 +01:00
Linus Torvalds	6d283dba37	vfs: add missing virtual cache flush after editing partial pages Andrew Morton pointed this out a month ago, and then I completely forgot about it. If we read a partial last page of a block device, we will zero out the end of the page, but since that page can then be mapped into user space, we should also make sure to flush the cache on architectures that have virtual caches. We have the flush_dcache_page() function for this, so use it. Now, in practice this really never matters, because nobody sane uses virtual caches to begin with, and they largely exist on old broken RISC arhitectures. And even if you did run on one of those obsolete CPU's, the whole "mmap and access the last partial page of a block device" behavior probably doesn't actually exist. The normal IO functions (read/write) will never see the zeroed-out part of the page that migth not be coherent in the cache, because they honor the size of the device. So I'm marking this for stable (3.7 only), but I'm not sure anybody will ever care. Pointed-out-by: Andrew Morton <akpm@linux-foundation.org> Cc: stable@vger.kernel.org # 3.7 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-01-14 13:17:50 -08:00
Eric Sandeen	3972f2603d	btrfs: update timestamps on truncate() truncate() vs. ftruncate() differ in the VFS; truncate() doesn't set (ATTR_CTIME \| ATTR_MTIME), and it's up to the fs to do the timestamp updates if the size changes. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com>	2013-01-14 13:53:37 -05:00
Zach Brown	f276795627	btrfs: fix btrfs_cont_expand() freeing IS_ERR em btrfs_cont_expand() tries to free an IS_ERR em as it gets an error from btrfs_get_extent() and breaks out of its loop. An instance of -EEXIST was reported in the wild: https://bugzilla.redhat.com/show_bug.cgi?id=874407 I have no idea if that -EEXIST is surprising, or not. Regardless, this error handling should be cleaned up to handle other reasonable errors (ENOMEM, EIO; whatever). This seemed to be the only buggy freeing of the relatively rare IS_ERR em so I opted to fix the caller rather than teach free_extent_map() to use IS_ERR_OR_NULL(). Signed-off-by: Zach Brown <zab@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com>	2013-01-14 13:53:23 -05:00
Liu Bo	f9e4fb5393	Btrfs: fix a bug when llseek for delalloc bytes behind prealloc extents xfstests case 285 complains. It it because btrfs did not try to find unwritten delalloc bytes(only dirty pages, not yet writeback) behind prealloc extents, it ends up finding nothing while we're with SEEK_DATA. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-01-14 13:53:22 -05:00
Liu Bo	1214b53f90	Btrfs: fix off-by-one in lseek Lock end is inclusive. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-01-14 13:53:22 -05:00
Liu Bo	3268a2468e	Btrfs: reset path lock state to zero We forgot to reset the path lock state to zero after we unlock the path block, and this can lead to the ASSERT checker in tree unlock API. Reported-by: Slava Barinov <rayslava@gmail.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-01-14 13:52:53 -05:00
Liu Bo	ac5c93005b	Btrfs: let allocation start from the right raid type This'd avoid us empty looping. Say we have only one disk and the metadata raid type will be defaultly DUP, and we do not need to start from index=0(RAID10) and get over two empty loops to index=2(DUP). Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-01-14 13:52:52 -05:00
Josef Bacik	f3fe820c20	Btrfs: add orphan before truncating pagecache Running xfstests 83 in a loop would sometimes fail the fsck. This happens because if we invalidate a page that already has an ordered extent setup for it we will complete the ordered extent ourselves, assuming that the truncate will clean everything up. The problem with this is there is plenty of time for the truncate to fail after we've done this work. So to fix this we need to add the orphan item first to make sure the cleanup gets done properly, and then we can truncate the pagecache and all that stuff and be safe. This fixes the btrfsck failures I was seeing while running 83 in a loop. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-01-14 13:52:52 -05:00
Josef Bacik	72bcd99d45	Btrfs: set flushing if we're limited flushing We still need to say we're flushing if we're limit flushing to keep somebody from coming in and stealing our reservation. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-01-14 13:52:51 -05:00
Miao Xie	9754767657	Btrfs: fix missing write access release in btrfs_ioctl_resize() We forget to give up the write access after we find some device operation is going on. Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-01-14 13:52:51 -05:00
Miao Xie	dba60f3f5d	Btrfs: fix resize a readonly device We should not resize a readonly device, fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-01-14 13:52:49 -05:00
Miao Xie	5c39da5b6c	Btrfs: do not delete a subvolume which is in a R/O subvolume Step to reproduce: # mkfs.btrfs <disk> # mount <disk> <mnt> # btrfs sub create <mnt>/subv0 # btrfs sub snap <mnt> <mnt>/subv0/snap0 # change <mnt>/subv0 from R/W to R/O # btrfs sub del <mnt>/subv0/snap0 We deleted the snapshot successfully. I think we should not be able to delete the snapshot since the parent subvolume is R/O. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>	2013-01-14 13:52:32 -05:00
Miao Xie	d86e56cf7d	Btrfs: disable qgroup id 0 Qgroup id 0 is a special number, we should set the id of a qgroup to 0. Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>	2013-01-14 13:52:31 -05:00
Lukas Czerner	cc975eb460	btrfs: get the device in write mode when deleting it When we're deleting the device we should get it in write mode since we're going to re-write the super block magic on that device. And it should fail if the device is read-only. Signed-off-by: Lukas Czerner <lczerner@redhat.com>	2013-01-14 13:52:31 -05:00
Tsutomu Itoh	cfa7a9ccda	Btrfs: fix memory leak in name_cache_insert() We should free name_cache_entry before returning from the error handling code. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>	2013-01-14 13:52:30 -05:00
Linus Torvalds	3441f0d26d	Driver core fixes for 3.8-rc3 Here are two patches for 3.8-rc3. One removes the __dev* defines from init.h now that all usages of it are gone from your tree. The other fix is for debugfs's paramater that was using the wrong base for the option. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (GNU/Linux) iEYEABECAAYFAlDzjcAACgkQMUfUDdst+ykJVwCcDqiKrO9p0dcH9WXN5aukBWX/ N8EAoK786v7PjtiVyNOJ/cPUDU8OHUpg =U4nL -----END PGP SIGNATURE----- Merge tag 'driver-core-3.8-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core fixes from Greg Kroah-Hartman: "Here are two patches for 3.8-rc3. One removes the __dev* defines from init.h now that all usages of it are gone from your tree. The other fix is for debugfs's paramater that was using the wrong base for the option. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>" * tag 'driver-core-3.8-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: debugfs: convert gid= argument from decimal, not octal Remove __dev* markings from init.h	2013-01-14 09:07:11 -08:00
Tejun Heo	9fb0a7da0c	writeback: add more tracepoints Add tracepoints for page dirtying, writeback_single_inode start, inode dirtying and writeback. For the latter two inode events, a pair of events are defined to denote start and end of the operations (the starting one has _start suffix and the one w/o suffix happens after the operation is complete). These inode ops are FS specific and can be non-trivial and having enclosing tracepoints is useful for external tracers. This is part of tracepoint additions to improve visiblity into dirtying / writeback operations for io tracer and userland. v2: writeback_dirty_inode[_start] TPs may be called for files on pseudo FSes w/ unregistered bdi. Check whether bdi->dev is %NULL before dereferencing. v3: buffer dirtying moved to a block TP. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-01-14 15:00:36 +01:00
Tejun Heo	5305cb8308	block: add block_{touch\|dirty}_buffer tracepoint The former is triggered from touch_buffer() and the latter mark_buffer_dirty(). This is part of tracepoint additions to improve visiblity into dirtying / writeback operations for io tracer and userland. v2: Transformed writeback_dirty_buffer to block_dirty_buffer and made it share TP definition with block_touch_buffer. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-01-14 15:00:36 +01:00
Tejun Heo	f0059afd3e	buffer: make touch_buffer() an exported function We want to add a trace point to touch_buffer() but macros and inline functions defined in header files can't have tracing points. Move touch_buffer() to fs/buffer.c and make it a proper function. The new exported function is also declared inline. As most uses of touch_buffer() are inside buffer.c with nilfs2 as the only other user, the effect of this change should be negligible. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-01-14 15:00:36 +01:00
Tejun Heo	3a366e614d	block: add missing block_bio_complete() tracepoint bio completion didn't kick block_bio_complete TP. Only dm was explicitly triggering the TP on IO completion. This makes block_bio_complete TP useless for tracers which want to know about bios, and all other bio based drivers skip generating blktrace completion events. This patch makes all bio completions via bio_endio() generate block_bio_complete TP. * Explicit trace_block_bio_complete() invocation removed from dm and the trace point is unexported. * @rq dropped from trace_block_bio_complete(). bios may fly around w/o queue associated. Verifying and accessing the assocaited queue belongs to TP probes. * blktrace now gets both request and bio completions. Make it ignore bio completions if request completion path is happening. This makes all bio based drivers generate blktrace completion events properly and makes the block_bio_complete TP actually useful. v2: With this change, block_bio_complete TP could be invoked on sg commands which have bio's with %NULL bi_bdev. Update TP assignment code to check whether bio->bi_bdev is %NULL before dereferencing. Signed-off-by: Tejun Heo <tj@kernel.org> Original-patch-by: Namhyung Kim <namhyung@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Alasdair Kergon <agk@redhat.com> Cc: dm-devel@redhat.com Cc: Neil Brown <neilb@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-01-14 15:00:36 +01:00
Namjae Jeon	163799872b	f2fs: avoid redundant time update for parent directory in f2fs_delete_entry In call to f2fs_delete_entry, 'dir' time modification code is put at two places. So, remove the redundant code for timing update. Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>	2013-01-14 09:43:27 +09:00
Namjae Jeon	ff9234ad4e	f2fs: remove redundant call to set_blocksize in f2fs_fill_super Since, f2fs supports only 4KB blocksize, which is set at the beginning in f2fs_fill_super. So, we do not need to again check this blocksize setting in such case. Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>	2013-01-14 09:41:30 +09:00
Abhijit Pawar	a17164e54b	fs/xfs remove obsolete simple_strto<foo> This patch replaces usages of obsolete simple_strtoul with kstrtoint in xfs_args and suffix_strtoul. Signed-off-by: Abhijit Pawar <abhi.c.pawar@gmail.com> Reviewed-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-01-13 14:42:07 -06:00
Eric Sandeen	d4608632ec	xfs: recalculate leaf entry pointer after compacting a dir2 block Dave Jones hit this assert when doing a compile on recent git, with CONFIG_XFS_DEBUG enabled: XFS: Assertion failed: (char )dup - (char )hdr == be16_to_cpu(xfs_dir2_data_unused_tag_p(dup)), file: fs/xfs/xfs_dir2_data.c, line: 828 Upon further digging, the tag found by xfs_dir2_data_unused_tag_p(dup) contained "2" and not the proper offset, and I found that this value was changed after the memmoves under "Use a stale leaf for our new entry." in xfs_dir2_block_addname(), i.e. memmove(&blp[mid + 1], &blp[mid], (highstale - mid) sizeof(*blp)); overwrote it. What has happened is that the previous call to xfs_dir2_block_compact() has rearranged things; it changes btp->count as well as the blp array. So after we make that call, we must recalculate the proper pointer to the leaf entries by making another call to xfs_dir2_block_leaf_p(). Dave provided a metadump image which led to a simple reproducer (create a particular filename in the affected directory) and this resolves the testcase as well as the bug on his live system. Thanks also to dchinner for looking at this one with me. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Tested-by: Dave Jones <davej@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-01-13 14:36:17 -06:00
Theodore Ts'o	7f5118629f	ext4: trigger the lazy inode table initialization after resize After we have finished extending the file system, we need to trigger a the lazy inode table thread to zero out the inode tables. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2013-01-13 08:41:45 -05:00
Eryu Guan	15b49132fc	ext4: check bh in ext4_read_block_bitmap() Validate the bh pointer before using it, since ext4_read_block_bitmap_nowait() might return NULL. I've seen this in fsfuzz testing. EXT4-fs error (device loop0): ext4_read_block_bitmap_nowait:385: comm touch: Cannot get buffer for block bitmap - block_group = 0, block_bitmap = 3925999616 BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<ffffffff8121de25>] ext4_wait_block_bitmap+0x25/0xe0 ... Call Trace: [<ffffffff8121e1e5>] ext4_read_block_bitmap+0x35/0x60 [<ffffffff8125e9c6>] ext4_free_blocks+0x236/0xb80 [<ffffffff811d0d36>] ? __getblk+0x36/0x70 [<ffffffff811d0a5f>] ? __find_get_block+0x8f/0x210 [<ffffffff81191ef3>] ? kmem_cache_free+0x33/0x140 [<ffffffff812678e5>] ext4_xattr_release_block+0x1b5/0x1d0 [<ffffffff812679be>] ext4_xattr_delete_inode+0xbe/0x100 [<ffffffff81222a7c>] ext4_free_inode+0x7c/0x4d0 [<ffffffff812277b8>] ? ext4_mark_inode_dirty+0x88/0x230 [<ffffffff8122993c>] ext4_evict_inode+0x32c/0x490 [<ffffffff811b8cd7>] evict+0xa7/0x1c0 [<ffffffff811b8ed3>] iput_final+0xe3/0x170 [<ffffffff811b8f9e>] iput+0x3e/0x50 [<ffffffff812316fd>] ext4_add_nondir+0x4d/0x90 [<ffffffff81231d0b>] ext4_create+0xeb/0x170 [<ffffffff811aae9c>] vfs_create+0xac/0xd0 [<ffffffff811ac845>] lookup_open+0x185/0x1c0 [<ffffffff8129e3b9>] ? selinux_inode_permission+0xa9/0x170 [<ffffffff811acb54>] do_last+0x2d4/0x7a0 [<ffffffff811af743>] path_openat+0xb3/0x480 [<ffffffff8116a8a1>] ? handle_mm_fault+0x251/0x3b0 [<ffffffff811afc49>] do_filp_open+0x49/0xa0 [<ffffffff811bbaad>] ? __alloc_fd+0xdd/0x150 [<ffffffff8119da28>] do_sys_open+0x108/0x1f0 [<ffffffff8119db51>] sys_open+0x21/0x30 [<ffffffff81618959>] system_call_fastpath+0x16/0x1b Also fix comment for ext4_read_block_bitmap_nowait() Signed-off-by: Eryu Guan <guaneryu@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org	2013-01-12 16:33:25 -05:00
Wang Shilong	aebf02430d	ext4: use unlikely to improve the efficiency of the kernel Because the function 'sb_getblk' seldomly fails to return NULL value,it will be better to use 'unlikely' to optimize it. Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2013-01-12 16:28:47 -05:00
Theodore Ts'o	860d21e2c5	ext4: return ENOMEM if sb_getblk() fails The only reason for sb_getblk() failing is if it can't allocate the buffer_head. So ENOMEM is more appropriate than EIO. In addition, make sure that the file system is marked as being inconsistent if sb_getblk() fails. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org	2013-01-12 16:19:36 -05:00
Miao Xie	10ee27a06c	vfs: re-implement writeback_inodes_sb(_nr)_if_idle() and rename them writeback_inodes_sb(_nr)_if_idle() is re-implemented by replacing down_read() with down_read_trylock() because - If ->s_umount is write locked, then the sb is not idle. That is writeback_inodes_sb(_nr)_if_idle() needn't wait for the lock. - writeback_inodes_sb(_nr)_if_idle() grabs s_umount lock when it want to start writeback, it may bring us deadlock problem when doing umount. In order to fix the problem, ext4 and btrfs implemented their own writeback functions instead of writeback_inodes_sb(_nr)_if_idle(), but it introduced the redundant code, it is better to implement a new writeback_inodes_sb(_nr)_if_idle(). The name of these two functions is cumbersome, so rename them to try_to_writeback_inodes_sb(_nr). This idea came from Christoph Hellwig. Some code is from the patch of Kamal Mostafa. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>	2013-01-12 10:47:43 +08:00
Xi Wang	6d92d4f6a7	fs/exec.c: work around icc miscompilation The tricky problem is this check: if (i++ >= max) icc (mis)optimizes this check as: if (++i > max) The check now becomes a no-op since max is MAX_ARG_STRINGS (0x7FFFFFFF). This is "allowed" by the C standard, assuming i++ never overflows, because signed integer overflow is undefined behavior. This optimization effectively reverts the previous commit `362e6663ef` ("exec.c, compat.c: fix count(), compat_count() bounds checking") that tries to fix the check. This patch simply moves ++ after the check. Signed-off-by: Xi Wang <xi.wang@gmail.com> Cc: Jason Baron <jbaron@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-01-11 14:54:55 -08:00
Kees Cook	d9777b8de4	fs/xfs: remove depends on CONFIG_EXPERIMENTAL The CONFIG_EXPERIMENTAL config item has not carried much meaning for a while now and is almost always enabled by default. As agreed during the Linux kernel summit, remove it from any "depends on" lines in Kconfigs. CC: Ben Myers <bpm@sgi.com> CC: Alex Elder <elder@kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Ben Myers <bpm@sgi.com>	2013-01-11 11:39:04 -08:00
Kees Cook	f11cb2271f	fs/nilfs2: remove depends on CONFIG_EXPERIMENTAL The CONFIG_EXPERIMENTAL config item has not carried much meaning for a while now and is almost always enabled by default. As agreed during the Linux kernel summit, remove it from any "depends on" lines in Kconfigs. CC: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>	2013-01-11 11:39:04 -08:00
Kees Cook	336d6d0323	fs/ecryptfs: remove depends on CONFIG_EXPERIMENTAL The CONFIG_EXPERIMENTAL config item has not carried much meaning for a while now and is almost always enabled by default. As agreed during the Linux kernel summit, remove it from any "depends on" lines in Kconfigs. CC: Tyler Hicks <tyhicks@canonical.com> CC: Dustin Kirkland <dustin.kirkland@gazzang.com> Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Tyler Hicks <tyhicks@canonical.com>	2013-01-11 11:39:04 -08:00
Kees Cook	1b6a78a522	fs/ceph: remove depends on CONFIG_EXPERIMENTAL The CONFIG_EXPERIMENTAL config item has not carried much meaning for a while now and is almost always enabled by default. As agreed during the Linux kernel summit, remove it from any "depends on" lines in Kconfigs. CC: Sage Weil <sage@inktank.com> Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Sage Weil <sage@inktank.com>	2013-01-11 11:39:04 -08:00
Seiji Aguchi	9f244e9cfd	pstore: Avoid deadlock in panic and emergency-restart path [Issue] When pstore is in panic and emergency-restart paths, it may be blocked in those paths because it simply takes spin_lock. This is an example scenario which pstore may hang up in a panic path: - cpuA grabs psinfo->buf_lock - cpuB panics and calls smp_send_stop - smp_send_stop sends IRQ to cpuA - after 1 second, cpuB gives up on cpuA and sends an NMI instead - cpuA is now in an NMI handler while still holding buf_lock - cpuB is deadlocked This case may happen if a firmware has a bug and cpuA is stuck talking with it more than one second. Also, this is a similar scenario in an emergency-restart path: - cpuA grabs psinfo->buf_lock and stucks in a firmware - cpuB kicks emergency-restart via either sysrq-b or hangcheck timer. And then, cpuB is deadlocked by taking psinfo->buf_lock again. [Solution] This patch avoids the deadlocking issues in both panic and emergency_restart paths by introducing a function, is_non_blocking_path(), to check if a cpu can be blocked in current path. With this patch, pstore is not blocked even if another cpu has taken a spin_lock, in those paths by changing from spin_lock_irqsave to spin_trylock_irqsave. In addition, according to a comment of emergency_restart() in kernel/sys.c, spin_lock shouldn't be taken in an emergency_restart path to avoid deadlock. This patch fits the comment below. <snip> /** * emergency_restart - reboot the system * * Without shutting down any hardware or taking any locks * reboot the system. This is called when we know we are in * trouble so this is our best effort to reboot. This is * safe to call in interrupt context. */ void emergency_restart(void) <snip> Signed-off-by: Seiji Aguchi <seiji.aguchi@hds.com> Acked-by: Don Zickus <dzickus@redhat.com> Signed-off-by: Tony Luck <tony.luck@intel.com>	2013-01-11 10:20:50 -08:00
Dave Reisner	f1688e0431	debugfs: convert gid= argument from decimal, not octal This patch technically breaks userspace, but I suspect that anyone who actually used this flag would have encountered this brokenness, declared it lunacy, and already sent a patch. Signed-off-by: Dave Reisner <dreisner@archlinux.org> Reviewed-by: Vasiliy Kulikov <segoon@openwall.com> Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2013-01-11 05:56:01 -08:00
Jaegeuk Kim	9eaeba7013	f2fs: move f2fs_balance_fs to punch_hole The f2fs_fallocate() has two operations: punch_hole and expand_size. Only in the case of punch_hole, dirty node pages can be produced, so let's trigger f2fs_balance_fs() in this case only. Furthermore, let's trigger it at every data truncation routine. Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>	2013-01-11 15:09:23 +09:00
Jaegeuk Kim	7d82db8316	f2fs: add f2fs_balance_fs in several interfaces The f2fs_balance_fs() is to check the number of free sections and decide whether it needs to conduct cleaning or not. If there are not enough free sections, the cleaning job should be started. In order to control an amount of free sections even under high utilization, f2fs should call f2fs_balance_fs at all the VFS interfaces that are able to produce dirty pages. This patch adds the function calls in the missing interfaces as follows. 1. f2fs_setxattr() The f2fs_setxattr() produces dirty node pages so that we should call f2fs_balance_fs() either likewise doing in other VFS interfaces such as f2fs_lookup(), f2fs_mkdir(), and so on. 2. f2fs_sync_file() We should guarantee serving free sections for syncing metadata during fsync. Previously, there is no space check before triggering checkpoint and sync_node_pages. Therefore, if a bunch of fsync calls are triggered under 100% of FS utilization, f2fs is able to be faced with no free sections, resulting in BUG_ON(). 3. f2fs_sync_fs() Before calling write_checkpoint(), we should guarantee that there are minimum free sections. 4. f2fs_write_inode() f2fs_write_inode() is also able to produce dirty node pages. Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>	2013-01-11 15:09:17 +09:00
Randy Dunlap	254adaa465	seq_file: fix new kernel-doc warnings Fix kernel-doc warnings in fs/seq_file.c: Warning(fs/seq_file.c:304): No description found for parameter 'whence' Warning(fs/seq_file.c:304): Excess function parameter 'origin' description in 'seq_lseek' Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-01-10 14:35:24 -08:00
Jaegeuk Kim	408e937561	f2fs: revisit the f2fs_gc flow I'd like to revisit the f2fs_gc flow and rewrite as follows. 1. In practical, the nGC parameter of f2fs_gc is meaningless. So, let's remove it. 2. Background GC marks victim blocks as dirty one at a time. 3. Foreground GC should do cleaning job until acquiring enough free sections. Afterwards, it needs to do checkpoint. Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>	2013-01-10 07:42:59 +09:00

... 6 7 8 9 10 ...

30831 Commits