Skip to content

Commit

Permalink
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kern…
Browse files Browse the repository at this point in the history
…el/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "In addition to bug fixes and cleanups, there are two new features for
  ext4 in 5.14:

   - Allow applications to poll on changes to
     /sys/fs/ext4/*/errors_count

   - Add the ioctl EXT4_IOC_CHECKPOINT which allows the journal to be
     checkpointed, truncated and discarded or zero'ed"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (32 commits)
  jbd2: export jbd2_journal_[un]register_shrinker()
  ext4: notify sysfs on errors_count value change
  fs: remove bdev_try_to_free_page callback
  ext4: remove bdev_try_to_free_page() callback
  jbd2: simplify journal_clean_one_cp_list()
  jbd2,ext4: add a shrinker to release checkpointed buffers
  jbd2: remove redundant buffer io error checks
  jbd2: don't abort the journal when freeing buffers
  jbd2: ensure abort the journal if detect IO error when writing original buffer back
  jbd2: remove the out label in __jbd2_journal_remove_checkpoint()
  ext4: no need to verify new add extent block
  jbd2: clean up misleading comments for jbd2_fc_release_bufs
  ext4: add check to prevent attempting to resize an fs with sparse_super2
  ext4: consolidate checks for resize of bigalloc into ext4_resize_begin
  ext4: remove duplicate definition of ext4_xattr_ibody_inline_set()
  ext4: fsmap: fix the block/inode bitmap comment
  ext4: fix comment for s_hash_unsigned
  ext4: use local variable ei instead of EXT4_I() macro
  ext4: fix avefreec in find_group_orlov
  ext4: correct the cache_nr in tracepoint ext4_es_shrink_exit
  ...
  • Loading branch information
Linus Torvalds committed Jul 1, 2021
2 parents 2cfa582 + 16aa4c9 commit a6ecc2a
Show file tree
Hide file tree
Showing 25 changed files with 720 additions and 215 deletions.
39 changes: 31 additions & 8 deletions Documentation/filesystems/ext4/journal.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,14 @@ Journal (jbd2)
--------------

Introduced in ext3, the ext4 filesystem employs a journal to protect the
filesystem against corruption in the case of a system crash. A small
continuous region of disk (default 128MiB) is reserved inside the
filesystem as a place to land “important” data writes on-disk as quickly
as possible. Once the important data transaction is fully written to the
disk and flushed from the disk write cache, a record of the data being
committed is also written to the journal. At some later point in time,
the journal code writes the transactions to their final locations on
disk (this could involve a lot of seeking or a lot of small
filesystem against metadata inconsistencies in the case of a system crash. Up
to 10,240,000 file system blocks (see man mke2fs(8) for more details on journal
size limits) can be reserved inside the filesystem as a place to land
important data writes on-disk as quickly as possible. Once the important
data transaction is fully written to the disk and flushed from the disk write
cache, a record of the data being committed is also written to the journal. At
some later point in time, the journal code writes the transactions to their
final locations on disk (this could involve a lot of seeking or a lot of small
read-write-erases) before erasing the commit record. Should the system
crash during the second slow write, the journal can be replayed all the
way to the latest commit record, guaranteeing the atomicity of whatever
Expand Down Expand Up @@ -731,3 +731,26 @@ point, the refcount for inode 11 is not reliable, but that gets fixed by the
replay of last inode 11 tag. Thus, by converting a non-idempotent procedure
into a series of idempotent outcomes, fast commits ensured idempotence during
the replay.

Journal Checkpoint
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Checkpointing the journal ensures all transactions and their associated buffers
are submitted to the disk. In-progress transactions are waited upon and included
in the checkpoint. Checkpointing is used internally during critical updates to
the filesystem including journal recovery, filesystem resizing, and freeing of
the journal_t structure.

A journal checkpoint can be triggered from userspace via the ioctl
EXT4_IOC_CHECKPOINT. This ioctl takes a single, u64 argument for flags.
Currently, three flags are supported. First, EXT4_IOC_CHECKPOINT_FLAG_DRY_RUN
can be used to verify input to the ioctl. It returns error if there is any
invalid input, otherwise it returns success without performing
any checkpointing. This can be used to check whether the ioctl exists on a
system and to verify there are no issues with arguments or flags. The
other two flags are EXT4_IOC_CHECKPOINT_FLAG_DISCARD and
EXT4_IOC_CHECKPOINT_FLAG_ZEROOUT. These flags cause the journal blocks to be
discarded or zero-filled, respectively, after the journal checkpoint is
complete. EXT4_IOC_CHECKPOINT_FLAG_DISCARD and EXT4_IOC_CHECKPOINT_FLAG_ZEROOUT
cannot both be set. The ioctl may be useful when snapshotting a system or for
complying with content deletion SLOs.
15 changes: 0 additions & 15 deletions fs/block_dev.c
Original file line number Diff line number Diff line change
Expand Up @@ -1673,20 +1673,6 @@ ssize_t blkdev_read_iter(struct kiocb *iocb, struct iov_iter *to)
}
EXPORT_SYMBOL_GPL(blkdev_read_iter);

/*
* Try to release a page associated with block device when the system
* is under memory pressure.
*/
static int blkdev_releasepage(struct page *page, gfp_t wait)
{
struct super_block *super = BDEV_I(page->mapping->host)->bdev.bd_super;

if (super && super->s_op->bdev_try_to_free_page)
return super->s_op->bdev_try_to_free_page(super, page, wait);

return try_to_free_buffers(page);
}

static int blkdev_writepages(struct address_space *mapping,
struct writeback_control *wbc)
{
Expand All @@ -1701,7 +1687,6 @@ static const struct address_space_operations def_blk_aops = {
.write_begin = blkdev_write_begin,
.write_end = blkdev_write_end,
.writepages = blkdev_writepages,
.releasepage = blkdev_releasepage,
.direct_IO = blkdev_direct_IO,
.migratepage = buffer_migrate_page_norefs,
.is_dirty_writeback = buffer_check_dirty_writeback,
Expand Down
18 changes: 16 additions & 2 deletions fs/ext4/ext4.h
Original file line number Diff line number Diff line change
Expand Up @@ -720,6 +720,7 @@ enum {
#define EXT4_IOC_CLEAR_ES_CACHE _IO('f', 40)
#define EXT4_IOC_GETSTATE _IOW('f', 41, __u32)
#define EXT4_IOC_GET_ES_CACHE _IOWR('f', 42, struct fiemap)
#define EXT4_IOC_CHECKPOINT _IOW('f', 43, __u32)

#define EXT4_IOC_SHUTDOWN _IOR ('X', 125, __u32)

Expand All @@ -741,6 +742,14 @@ enum {
#define EXT4_STATE_FLAG_NEWENTRY 0x00000004
#define EXT4_STATE_FLAG_DA_ALLOC_CLOSE 0x00000008

/* flags for ioctl EXT4_IOC_CHECKPOINT */
#define EXT4_IOC_CHECKPOINT_FLAG_DISCARD 0x1
#define EXT4_IOC_CHECKPOINT_FLAG_ZEROOUT 0x2
#define EXT4_IOC_CHECKPOINT_FLAG_DRY_RUN 0x4
#define EXT4_IOC_CHECKPOINT_FLAG_VALID (EXT4_IOC_CHECKPOINT_FLAG_DISCARD | \
EXT4_IOC_CHECKPOINT_FLAG_ZEROOUT | \
EXT4_IOC_CHECKPOINT_FLAG_DRY_RUN)

#if defined(__KERNEL__) && defined(CONFIG_COMPAT)
/*
* ioctl commands in 32 bit emulation
Expand Down Expand Up @@ -1477,7 +1486,7 @@ struct ext4_sb_info {
unsigned int s_inode_goal;
u32 s_hash_seed[4];
int s_def_hash_version;
int s_hash_unsigned; /* 3 if hash should be signed, 0 if not */
int s_hash_unsigned; /* 3 if hash should be unsigned, 0 if not */
struct percpu_counter s_freeclusters_counter;
struct percpu_counter s_freeinodes_counter;
struct percpu_counter s_dirs_counter;
Expand All @@ -1488,6 +1497,7 @@ struct ext4_sb_info {
struct kobject s_kobj;
struct completion s_kobj_unregister;
struct super_block *s_sb;
struct buffer_head *s_mmp_bh;

/* Journaling */
struct journal_s *s_journal;
Expand Down Expand Up @@ -3614,6 +3624,7 @@ extern const struct inode_operations ext4_symlink_inode_operations;
extern const struct inode_operations ext4_fast_symlink_inode_operations;

/* sysfs.c */
extern void ext4_notify_error_sysfs(struct ext4_sb_info *sbi);
extern int ext4_register_sysfs(struct super_block *sb);
extern void ext4_unregister_sysfs(struct super_block *sb);
extern int __init ext4_init_sysfs(void);
Expand Down Expand Up @@ -3720,6 +3731,9 @@ extern struct ext4_io_end_vec *ext4_last_io_end_vec(ext4_io_end_t *io_end);
/* mmp.c */
extern int ext4_multi_mount_protect(struct super_block *, ext4_fsblk_t);

/* mmp.c */
extern void ext4_stop_mmpd(struct ext4_sb_info *sbi);

/* verity.c */
extern const struct fsverity_operations ext4_verityops;

Expand Down Expand Up @@ -3784,7 +3798,7 @@ static inline int ext4_buffer_uptodate(struct buffer_head *bh)
* have to read the block because we may read the old data
* successfully.
*/
if (!buffer_uptodate(bh) && buffer_write_io_error(bh))
if (buffer_write_io_error(bh))
set_buffer_uptodate(bh);
return buffer_uptodate(bh);
}
Expand Down
4 changes: 4 additions & 0 deletions fs/ext4/extents.c
Original file line number Diff line number Diff line change
Expand Up @@ -825,6 +825,7 @@ void ext4_ext_tree_init(handle_t *handle, struct inode *inode)
eh->eh_entries = 0;
eh->eh_magic = EXT4_EXT_MAGIC;
eh->eh_max = cpu_to_le16(ext4_ext_space_root(inode, 0));
eh->eh_generation = 0;
ext4_mark_inode_dirty(handle, inode);
}

Expand Down Expand Up @@ -1090,6 +1091,7 @@ static int ext4_ext_split(handle_t *handle, struct inode *inode,
neh->eh_max = cpu_to_le16(ext4_ext_space_block(inode, 0));
neh->eh_magic = EXT4_EXT_MAGIC;
neh->eh_depth = 0;
neh->eh_generation = 0;

/* move remainder of path[depth] to the new leaf */
if (unlikely(path[depth].p_hdr->eh_entries !=
Expand Down Expand Up @@ -1167,6 +1169,7 @@ static int ext4_ext_split(handle_t *handle, struct inode *inode,
neh->eh_magic = EXT4_EXT_MAGIC;
neh->eh_max = cpu_to_le16(ext4_ext_space_block_idx(inode, 0));
neh->eh_depth = cpu_to_le16(depth - i);
neh->eh_generation = 0;
fidx = EXT_FIRST_INDEX(neh);
fidx->ei_block = border;
ext4_idx_store_pblock(fidx, oldblock);
Expand Down Expand Up @@ -1306,6 +1309,7 @@ static int ext4_ext_grow_indepth(handle_t *handle, struct inode *inode,
neh->eh_magic = EXT4_EXT_MAGIC;
ext4_extent_block_csum_set(inode, neh);
set_buffer_uptodate(bh);
set_buffer_verified(bh);
unlock_buffer(bh);

err = ext4_handle_dirty_metadata(handle, inode, bh);
Expand Down
4 changes: 1 addition & 3 deletions fs/ext4/extents_status.c
Original file line number Diff line number Diff line change
Expand Up @@ -1574,11 +1574,9 @@ static unsigned long ext4_es_scan(struct shrinker *shrink,
ret = percpu_counter_read_positive(&sbi->s_es_stats.es_stats_shk_cnt);
trace_ext4_es_shrink_scan_enter(sbi->s_sb, nr_to_scan, ret);

if (!nr_to_scan)
return ret;

nr_shrunk = __es_shrink(sbi, nr_to_scan, NULL);

ret = percpu_counter_read_positive(&sbi->s_es_stats.es_stats_shk_cnt);
trace_ext4_es_shrink_scan_exit(sbi->s_sb, nr_shrunk, ret);
return nr_shrunk;
}
Expand Down
4 changes: 2 additions & 2 deletions fs/ext4/fsmap.h
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ int ext4_getfsmap(struct super_block *sb, struct ext4_fsmap_head *head,
#define EXT4_FMR_OWN_INODES FMR_OWNER('X', 5) /* inodes */
#define EXT4_FMR_OWN_GDT FMR_OWNER('f', 1) /* group descriptors */
#define EXT4_FMR_OWN_RESV_GDT FMR_OWNER('f', 2) /* reserved gdt blocks */
#define EXT4_FMR_OWN_BLKBM FMR_OWNER('f', 3) /* inode bitmap */
#define EXT4_FMR_OWN_INOBM FMR_OWNER('f', 4) /* block bitmap */
#define EXT4_FMR_OWN_BLKBM FMR_OWNER('f', 3) /* block bitmap */
#define EXT4_FMR_OWN_INOBM FMR_OWNER('f', 4) /* inode bitmap */

#endif /* __EXT4_FSMAP_H__ */
11 changes: 5 additions & 6 deletions fs/ext4/ialloc.c
Original file line number Diff line number Diff line change
Expand Up @@ -402,7 +402,7 @@ static void get_orlov_stats(struct super_block *sb, ext4_group_t g,
*
* We always try to spread first-level directories.
*
* If there are blockgroups with both free inodes and free blocks counts
* If there are blockgroups with both free inodes and free clusters counts
* not worse than average we return one with smallest directory count.
* Otherwise we simply return a random group.
*
Expand All @@ -411,7 +411,7 @@ static void get_orlov_stats(struct super_block *sb, ext4_group_t g,
* It's OK to put directory into a group unless
* it has too many directories already (max_dirs) or
* it has too few free inodes left (min_inodes) or
* it has too few free blocks left (min_blocks) or
* it has too few free clusters left (min_clusters) or
* Parent's group is preferred, if it doesn't satisfy these
* conditions we search cyclically through the rest. If none
* of the groups look good we just look for a group with more
Expand All @@ -427,7 +427,7 @@ static int find_group_orlov(struct super_block *sb, struct inode *parent,
ext4_group_t real_ngroups = ext4_get_groups_count(sb);
int inodes_per_group = EXT4_INODES_PER_GROUP(sb);
unsigned int freei, avefreei, grp_free;
ext4_fsblk_t freeb, avefreec;
ext4_fsblk_t freec, avefreec;
unsigned int ndirs;
int max_dirs, min_inodes;
ext4_grpblk_t min_clusters;
Expand All @@ -446,9 +446,8 @@ static int find_group_orlov(struct super_block *sb, struct inode *parent,

freei = percpu_counter_read_positive(&sbi->s_freeinodes_counter);
avefreei = freei / ngroups;
freeb = EXT4_C2B(sbi,
percpu_counter_read_positive(&sbi->s_freeclusters_counter));
avefreec = freeb;
freec = percpu_counter_read_positive(&sbi->s_freeclusters_counter);
avefreec = freec;
do_div(avefreec, ngroups);
ndirs = percpu_counter_read_positive(&sbi->s_dirs_counter);

Expand Down
11 changes: 5 additions & 6 deletions fs/ext4/inline.c
Original file line number Diff line number Diff line change
Expand Up @@ -204,7 +204,7 @@ static int ext4_read_inline_data(struct inode *inode, void *buffer,
/*
* write the buffer to the inline inode.
* If 'create' is set, we don't need to do the extra copy in the xattr
* value since it is already handled by ext4_xattr_ibody_inline_set.
* value since it is already handled by ext4_xattr_ibody_set.
* That saves us one memcpy.
*/
static void ext4_write_inline_data(struct inode *inode, struct ext4_iloc *iloc,
Expand Down Expand Up @@ -286,7 +286,7 @@ static int ext4_create_inline_data(handle_t *handle,

BUG_ON(!is.s.not_found);

error = ext4_xattr_ibody_inline_set(handle, inode, &i, &is);
error = ext4_xattr_ibody_set(handle, inode, &i, &is);
if (error) {
if (error == -ENOSPC)
ext4_clear_inode_state(inode,
Expand Down Expand Up @@ -358,7 +358,7 @@ static int ext4_update_inline_data(handle_t *handle, struct inode *inode,
i.value = value;
i.value_len = len;

error = ext4_xattr_ibody_inline_set(handle, inode, &i, &is);
error = ext4_xattr_ibody_set(handle, inode, &i, &is);
if (error)
goto out;

Expand Down Expand Up @@ -431,7 +431,7 @@ static int ext4_destroy_inline_data_nolock(handle_t *handle,
if (error)
goto out;

error = ext4_xattr_ibody_inline_set(handle, inode, &i, &is);
error = ext4_xattr_ibody_set(handle, inode, &i, &is);
if (error)
goto out;

Expand Down Expand Up @@ -1925,8 +1925,7 @@ int ext4_inline_data_truncate(struct inode *inode, int *has_inline)
i.value = value;
i.value_len = i_size > EXT4_MIN_INLINE_DATA_SIZE ?
i_size - EXT4_MIN_INLINE_DATA_SIZE : 0;
err = ext4_xattr_ibody_inline_set(handle, inode,
&i, &is);
err = ext4_xattr_ibody_set(handle, inode, &i, &is);
if (err)
goto out_error;
}
Expand Down
8 changes: 4 additions & 4 deletions fs/ext4/inode.c
Original file line number Diff line number Diff line change
Expand Up @@ -374,7 +374,7 @@ void ext4_da_update_reserve_space(struct inode *inode,
ei->i_reserved_data_blocks -= used;
percpu_counter_sub(&sbi->s_dirtyclusters_counter, used);

spin_unlock(&EXT4_I(inode)->i_block_reservation_lock);
spin_unlock(&ei->i_block_reservation_lock);

/* Update quota subsystem for data blocks */
if (quota_claim)
Expand Down Expand Up @@ -3223,7 +3223,7 @@ static sector_t ext4_bmap(struct address_space *mapping, sector_t block)
ext4_clear_inode_state(inode, EXT4_STATE_JDATA);
journal = EXT4_JOURNAL(inode);
jbd2_journal_lock_updates(journal);
err = jbd2_journal_flush(journal);
err = jbd2_journal_flush(journal, 0);
jbd2_journal_unlock_updates(journal);

if (err)
Expand Down Expand Up @@ -3418,7 +3418,7 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
* i_disksize out to i_size. This could be beyond where direct I/O is
* happening and thus expose allocated blocks to direct I/O reads.
*/
else if ((map->m_lblk * (1 << blkbits)) >= i_size_read(inode))
else if (((loff_t)map->m_lblk << blkbits) >= i_size_read(inode))
m_flags = EXT4_GET_BLOCKS_CREATE;
else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
m_flags = EXT4_GET_BLOCKS_IO_CREATE_EXT;
Expand Down Expand Up @@ -6005,7 +6005,7 @@ int ext4_change_inode_journal_flag(struct inode *inode, int val)
if (val)
ext4_set_inode_flag(inode, EXT4_INODE_JOURNAL_DATA);
else {
err = jbd2_journal_flush(journal);
err = jbd2_journal_flush(journal, 0);
if (err < 0) {
jbd2_journal_unlock_updates(journal);
percpu_up_write(&sbi->s_writepages_rwsem);
Expand Down
Loading

0 comments on commit a6ecc2a

Please sign in to comment.